Add the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces#35
Add the Bootstrapped Dual Policy Iteration algorithm for discrete action spaces#35steckdenis wants to merge 2 commits intoStable-Baselines-Team:masterfrom
Conversation
Large experience buffer sizes lead to warnings about memory allocations, and on the Github CI, to memory allocation failures. So, small experience buffers are important.
|
|
||
|
|
||
| .. note:: | ||
| Non-array spaces such as ``Dict`` or ``Tuple`` are not currently supported by any algorithm. |
There was a problem hiding this comment.
it seems that we forgot to update the contrib doc when adding support for Dict obs, the correct formulation should be the one from https://github.com/DLR-RM/stable-baselines3/blob/master/docs/guide/algos.rst
| ============ =========== ============ ================= =============== ================ | ||
| TQC ✔️ ❌ ❌ ❌ ❌ | ||
| QR-DQN ️❌ ️✔️ ❌ ❌ ❌ | ||
| BDPI ❌ ✔️ ❌ ❌ ✔️ |
There was a problem hiding this comment.
it says "multiprocessing" but I only see tests with one environment in the code...
and you probably need DLR-RM/stable-baselines3#439 to make it work with multiple envs.
There was a problem hiding this comment.
I may have mis-understood what "Multiprocessing" in this document means, and what PPO in stable-baselines is doing.
BDPI distributes its training updates on several processes, even if only one environment is used. To me, this is multi-processing, comparable to PPO that uses MPI to distribute compute. But if PPO uses multiple environments to be able to do multiprocessing, then I understand that "Multiprocessing" in the documentation means "compatible with multiple envs", not just "fast because it uses several processes".
Should I add a note, or a second column to distinguish "multiple envs" from "multi-processing with one env"?
There was a problem hiding this comment.
To me, this is multi-processing, comparable to PPO that uses MPI to distribute compute.
PPO with MPI distributes env and training compute (and is currently not implemented in SB3).
then I understand that "Multiprocessing" in the documentation means "compatible with multiple envs"
yes, that's the meaning (because we can use SubProcEnv to distribute data collection)
Should I add a note, or a second column to distinguish "multiple envs" from "multi-processing with one env"?
no, I think we have already enough columns in this table.
| LunarLander | ||
| ^^^^^^^^^^^ | ||
|
|
||
| Results for BDPI are available in `this Github issue <https://github.com/DLR-RM/stable-baselines3/issues/499>`_. |
There was a problem hiding this comment.
as you are aiming for sample efficiency, I would prefer a comparison to DQN, QR-DQN (with tuned hyperparameters, results are already linked in the documentation: #13)
Regarding which envs to compare too, please do the classic control ones + 2 Atari games at least (Pong, Breakout) using the zoo, so we can compare the results with QR-DQN and DQN.
There was a problem hiding this comment.
I would also like a comparison of the compromise sample efficiency vs training time (how much more time does it take to train?)
|
|
||
| # Update the critic (code taken from DQN) | ||
| with th.no_grad(): | ||
| qvA = criticA(replay_data.next_observations) |
There was a problem hiding this comment.
please use meaningful variable names
This sb3_contrib pull request follows an issue opened in stable-baselines3.
Description
This pull request adds the Bootstrapped Dual Policy Iteration algorithm to stable-baselines3-contrib, with documentation and updated unit tests (I was able to make them work by replacing logger.record, as in SB3 algos, with self.logger.record, as in the other SB3-contrib algos).
The original BDPI paper is https://arxiv.org/abs/1903.04193. The main reason I propose to have BDPI in stable-baselines3-contrib is that it is quite different from other algorithms, as it heavily focuses on sample-efficiency at the cost of compute-efficiency (which is nice for slow-sampled robotic tasks). The main results in the paper show that BDPI outperforms several state-of-the-art RL algorithms on many environments (Table is difficult to explore into, and Hallway comes from gym-miniworld and is a 3D environment):
I have reproduced these results with the BDPI algorithm I propose in this PR, on LunarLander. PPO, DQN and A2C have been run using the default hyper-parameters used by rl-baselines3-zoo (I suppose the tuned ones?), for 8 random seeds each. The BDPI curve is also the result of 8 random seeds. I apologize for the truncated runs, BDPI and DQN only ran for 100K time-steps (for BDPI, due to time constraints, as it takes about an hour per run to perform 100K time-steps):
Types of changes
Checklist:
make format(required)make check-codestyleandmake lint(required)make pytestandmake typeboth pass. (required)