Stop action gradient in policy gradient loss

The current implementation of `policy_gradient_loss` is:
```python
log_pi_a_t = distributions.softmax().logprob(a_t, logits_t)
adv_t = jax.lax.select(use_stop_gradient, jax.lax.stop_gradient(adv_t), adv_t)
loss_per_timestep = -log_pi_a_t * adv_t
```
It's good that the gradients are already stopped around the advantages, but they should also be stopped around the actions to ensure an unbiased gradient estimator.

This is important when the actions are sampled as part of the training graph (MPO-style algos, imagination training with world models) rather than coming from the replay buffer, and the actor distribution implements a gradient for `sample()` (e.g. gaussian, or straight-through categoricals).




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop action gradient in policy gradient loss #109

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stop action gradient in policy gradient loss #109

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions