Advertisement
Guest User

Untitled

a guest
May 23rd, 2019
95
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.66 KB | None | 0 0
  1. def evaluate_policy_return(T, behavioral_policy, target_policy):
  2. returns = []
  3. for trajectory in T:
  4. importance_weight = 1
  5. trajectory_return = 0
  6. for transition in trajectory:
  7. state, action, reward = transition[0 : 3]
  8. action_prob_b = behavioral_policy(state, action)
  9. action_prob_t = target_policy(state, action)
  10.  
  11. importance_weight *= (action_prob_t / action_prob_b)
  12. trajectory_return += reward
  13.  
  14. returns.append(trajectory_return * importance_weight)
  15.  
  16. return np.mean(returns)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement