Untitled

# We are computing the cost-to-go using the following formula:
# Jpi = (I-y*Ppi)^-1*Cpi

# We started by making some changes to the policy so the rows and the columns correspond to the states

Ppi = np.array([[0, 0.5, 0, 0.5, 0, 0], [0, 0, 0.5, 0, 0.5, 0], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0]])

# We have a new c(pi)

Cpi = np.array([[0, 0.1, 0, 0.1], [0, 0.1, 0, 0.1], [0, 1, 0, 0], [0, 0, 0, 0.2], [0, 0, 0, 0.2], [0, 0, 0, 0]])

I = np.eye(6)

# DÚVIDA: é suposto usar esta formula? Como a Policy não é um vector nxn, não dá para subtrair à Identidade por isso alterámos o PPi <-----------
Jpi = np.dot(np.linalg.inv(I-0.99*Ppi), Cpi)

print("Cost-to-go: \n", Jpi)