Simple Reinforcement learning tutorials
machine-learning
tutorial
reinforcement-learning
q-learning
dqn
policy-gradient
sarsa
tensorflow-tutorials
a3c
deep-q-network
ddpg
actor-critic
asynchronous-advantage-actor-critic
double-dqn
prioritized-replay
sarsa-lambda
dueling-dqn
deep-deterministic-policy-gradient
proximal-policy-optimization
ppo
-
Updated
May 29, 2020 - Python
I was surprised to see this loss function because it is generally used when the target is a distribution (i.e. sums to 1). This is not the case for the advantage estimate. However, I worked out the math and it does appear to be doing the right thing which is neat!
I think this trick should be mentioned in the code.