Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would a DQN give similar values to all actions in the action space (2) for all observations

I have a DQN algorithm that learns (the loss converges to 0) but unfortunately it learns a Q value function such that both of the Q values for each of the 2 possible actions are very similar. It is worth noting that the Q values change by very little over each observation.

Details:

  • The algorithm plays CartPole-v1 from OpenAI Gym but uses the screen pixels as an observation rather than the 4 values provided

  • The reward function I have provided provides a reward of: 0.1 if not game over and -1 if game over

  • The decay-rate (gamma ) is 0.95

  • epsilon is 1 for the first 3200 actions (to populate some of the replay memory) and then annealed over 100,000 steps to the value of 0.01

  • the replay memory is of size 10,000

  • The architecture of the conv net is:

    • input layer of size screen_pixels
    • conv layer 1 with 32 filters with kernel (8,8) and stride (4,4), relu activation function and is padded to be the same size on output as input
    • conv layer 2 with 64 filters with kernel (4,4) and stride (2,2), relu activation function and is padded to be the same size on output as input
    • conv layer 3 with 64 filters with kernel (3,3) and stride (1,1), relu activation function and is padded to be the same size on output as input
    • a flatten layer (this is to change the shape of the data to allow it to then feed into a fully connected layer)
    • Fully connected layer with 512 nodes and relu activation function
    • An output fully connected layer with 2 nodes (the action space)
  • The learning rate of the convolutional neural network is 0.0001
  • The code has been developed in keras and uses experience replay and double deep q learning
  • The original image is reduced from (400, 600, 3) to (60, 84, 4) by greyscaling, resizing, cropping and then stacking 4 images together before providing this to the conv net
  • The target network is updated every 2 online network updates.
like image 316
MichaelAndroidNewbie Avatar asked Feb 02 '26 03:02

MichaelAndroidNewbie


1 Answers

Providing a positive reward of 0.1 on every step as long as the game is not over may make the game over -1 punishment almost irrelevant. Particularly considering the discount factor that you are using.

It is difficult to judge without looking at your source code but I would initially suggest you to provide only a negative reward at the end of the game and remove positive rewards.

like image 146
Juan Leni Avatar answered Feb 04 '26 17:02

Juan Leni



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!