I'm reading through the original PPO paper and trying to match this up to the input parameters of the stable-baselines PPO2 model.
One thing I do not understand is the total_timesteps parameter in the learn method.
The paper mentions
One style of policy gradient implementation... runs the policy for T timesteps (where T is much less than the episode length)
While the stable-baselines documentation describes the total_timesteps parameter as
(int) The total number of samples to train on
Therefore I would think that T in the paper and total_timesteps in the documentation are the same parameter.
What I do not understand is the following:
Does total_timesteps always need to be less than or equal to the total number of available "frames" (samples) in an environment (say if I had a finite number of frames like 1,000,000). If so, why?
By setting total_timesteps to a number less than the number of available frames, what portion of the training data does the agent see? For example, if total_timesteps=1000, does the agent only ever see the first 1000 frames?
Is an episode defined as the total number of available frames, or is it defined as when the agent first "looses" / "dies"? If the latter, then how can you know in advance when the agent will die to be able set total_timesteps to a lesser value?
I'm still learning the terminology behind RL, so I hope I've been able to explain my question clearly above. Any help / tips would be very much welcomed.
total_timesteps is the number of steps in total the agent will do for any environment. The total_timesteps can be across several episodes, meaning that this value is not bound to some maximum. Let's say you have an environment with more than 1000 timesteps.
n_steps – (int) The number of steps to run for each environment per update (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel) ent_coef – (float) Entropy coefficient for the loss calculation.
DummyVecEnv (env_fns)[source] Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process.
Stable Baselines is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.
According to the stable-baselines source code
The total timestep argument also use n_steps where number of updates is calculated based as follows:
n_updates = total_timesteps // self.n_batch
where n_batch is n_steps times the number of vectorised environments.
This means that if you were to have 1 environment running with n_step set to 32 and total_timesteps = 25000, you would do 781 updates to your policy during the learn call (excluding epochs, as PPO can do several updates on a single batch)
The lession is:
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With