a guest Mar 25th, 2019 67 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. def update(self, states, actions, rewards, values):
  2.     # Calculate values (or advantage) at outside of update process.
  3.     advantage = reward - values
  4.     action_probs =
  5.     selected_action_probs = action_probs[self.to_one_hot(actions)]
  6.     neg_logs = - log(selected_action_probs)
  7.     policy_loss = reduce_mean(neg_logs * advantages)
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand