Untitled

"""A simple Markov Decision Process describing a painting robot.

This problem is inspired by the "Planning" lesson from the CS 7637 (Knowledge-
Based Artificial Intelligence) course at Georgia Institue of Technology, with
Prof. Ashok Goyel.  It was created in support of an MDP assignment in CS 7641
(Machine Learning), with Profs. Charles Isbell and Michael Littman.

A robot is tasked with painting a ladder and a ceiling.  The following
independent states are possible, for a total of 8:

- The robot may be positioned on the ladder or on the floor.
- The ceiling may be wet (painted) or dry (unpainted).
- The ladder may be wet (painted) or dry (unpainted).

The initial state is robot: on floor, ceiling: dry, ladder: dry.

The robot may take the following actions:

- Climb the ladder
- Descend the ladder
- Paint the ladder
- Paint the ceiling

The robot's actions are reliable.  There is no stochastic element to this
problem.

Climbing or descending the ladder changes the robot's state.  Painting the
ceiling or ladder changes the ceiling's or ladder's state.  Some action-state
combinations result in no change in state, such as painting the ceiling while
on the floor, or climbing the ladder while already on the ladder.

The reward for each action is -1, with the following exceptions:

-5 if the robot paints the ladder while on it.
-5 if the robot climbs a wet ladder.
+10 (and the problem ends) when the ceiling and ladder are both wet, and the
robot is on the floor.

"""

import numpy as np


def get_matrices():
    """Provide transition and rewards matrices for a Robot Painter MDP.

    Returns (P, R), where P contains the transition probability matrices, and
    R is the rewards matrix.  Both P and R are numpy arrays.  P is a 3-D array
    of shape (Actions, States, States).  R is a 2-D array of shape (State,
    Action).

    The order of states is:
    robot: on floor, ladder: dry, ceiling: dry
    robot: on floor, ladder: dry, ceiling: wet
    robot: on floor, ladder: wet, ceiling: dry
    robot: on floor, ladder: wet, ceiling: wet
    robot: on ladder, ladder: dry, ceiling: dry
    robot: on ladder, ladder: dry, ceiling: wet
    robot: on ladder, ladder: wet, ceiling: dry
    robot: on ladder, ladder: wet, ceiling: wet

    The order of actions is:
    climb ladder
    descend ladder
    paint ladder
    paint ceiling

    """

    P = np.array([
        # Action: climb ladder
        [
            [0., 0., 0., 0., 1., 0., 0., 0.],
            [0., 0., 0., 0., 0., 1., 0., 0.],
            [0., 0., 0., 0., 0., 0., 1., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.],
            [0., 0., 0., 0., 1., 0., 0., 0.],
            [0., 0., 0., 0., 0., 1., 0., 0.],
            [0., 0., 0., 0., 0., 0., 1., 0.],
            [0., 0., 0., 0., 0., 0., 0., 1.]
        ],
        # Action: descend ladder
        [
            [1., 0., 0., 0., 0., 0., 0., 0.],
            [0., 1., 0., 0., 0., 0., 0., 0.],
            [0., 0., 1., 0., 0., 0., 0., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.],
            [1., 0., 0., 0., 0., 0., 0., 0.],
            [0., 1., 0., 0., 0., 0., 0., 0.],
            [0., 0., 1., 0., 0., 0., 0., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.]
        ],
        # Action: paint ladder
        [
            [0., 0., 1., 0., 0., 0., 0., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.],
            [0., 0., 1., 0., 0., 0., 0., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.],
            [0., 0., 0., 0., 0., 0., 1., 0.],
            [0., 0., 0., 0., 0., 0., 0., 1.],
            [0., 0., 0., 0., 0., 0., 1., 0.],
            [0., 0., 0., 0., 0., 0., 0., 1.]
        ],
        # Action: paint ceiling
        [
            [1., 0., 0., 0., 0., 0., 0., 0.],
            [0., 1., 0., 0., 0., 0., 0., 0.],
            [0., 0., 1., 0., 0., 0., 0., 0.],
            [0., 0., 0., 1., 0., 0., 0., 0.],
            [0., 0., 0., 0., 0., 1., 0., 0.],
            [0., 0., 0., 0., 0., 1., 0., 0.],
            [0., 0., 0., 0., 0., 0., 0., 1.],
            [0., 0., 0., 0., 0., 0., 0., 1.]
        ]
    ])

    R = np.array([
        [-1., -1., -1., -1.],
        [-1., -1., 10., -1.],
        [-5., -1., -1., -1.],
        [0., 0., 0., 0.],
        [-1., -1., -5., -1.],
        [-1., -1., -5., -1.],
        [-1., -1., -5., -1.],
        [-1., 10., -5., -1.]
    ])

    return (P, R)