I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.
Details:
The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided
The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over
The decay-rate (gamma ) is 0.95
epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01
the replay memory is of size 10,000
The architecture of the conv net is:
Providing a positive reward of 0.1 on every step as long as the game is not over may make the game over -1 punishment almost irrelevant. Particularly considering the discount factor that you are using.
It is difficult to judge without looking at your source code but I would initially suggest you to provide only a negative reward at the end of the game and remove positive rewards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With