Add support for DGPO (ICLR 2026) to GRPO by YanqiDai · Pull Request #5102 · huggingface/trl

YanqiDai · 2026-02-15T14:37:52Z

What does this PR do?

Add DGPO (Difficulty-Aware Group Policy Optimization) support to GRPO.

References: MathForge (ICLR 2026) GitHub, Paper.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). No
Did you read the contributor guideline,
Pull Request section? Yes
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case. No
Did you make sure to update the documentation with your changes? Yes, it's in docs/source/grpo_trainer.md and docs/source/paper_index.md.
Did you write any new necessary tests? Yes, it's test_training_dgpo in tests/test_grpo_trainer.py.

Motivation

This PR integrates DGPO (Difficulty-Aware Group Policy Optimization) from MathForge (ICLR 2026, paper) into the GRPO trainer. DGPO improves group-based RL by:

Difficulty-aware advantage scaling — Using MAD (Mean Absolute Deviation) instead of standard deviation when normalizing advantages (DGAE), which can address the implicit imbalance where the update magnitudes are suppressed for both easier and harder questions and peak for those of moderate difficulty.
Difficulty-aware question weighting (DQW) — Assigning higher weight to harder questions (lower mean accuracy) so that the policy focuses more on improving on difficult items while keeping a fixed total weight budget.
Valid token-level loss averaging — Scaling advantages so that (a) only valid samples (where std_rewards != 0) contribute to the effective normalizer, and (b) multi-GPU training is balanced by accounting for per-process valid token counts. This yields a proper token-level average loss across valid data and devices.

These options are useful for mathematical reasoning and other settings where question difficulty is heterogeneous and reward variance can be zero for some groups.

Changes

Configuration (`grpo_config.py`)

use_dgpo_dgae (bool, default False): When True and scale_rewards != "none", advantages are normalized by MAD instead of standard deviation (DGAE).
use_dgpo_dqw (bool, default False): When True, advantages are multiplied by per-question difficulty weights (DQW). Zero-variance questions get weight 1; others are weighted by a softmax over negative mean accuracy so that harder questions get higher weight; weights sum to num_questions.
dgpo_dqw_temp (float, default 2.0): Temperature for the DQW softmax.
dgpo_dqw_acc_reward_index (int, default 0): Index of the accuracy reward in reward_funcs used to compute per-question mean accuracy for DQW.

All new parameters are documented with a reference to the MathForge paper (ICLR 2026).

Trainer (`grpo_trainer.py`)

DGAE
In both sum_then_normalize and normalize_then_sum branches, when use_dgpo_dgae is True and rewards are scaled, the advantage denominator uses MAD instead of std:
advantage = (reward - mean) / (MAD + eps).
DQW
After advantage computation (and after the valid-token scaling described below), when use_dgpo_dqw is True:
- Per-question mean and std of the accuracy reward at dgpo_dqw_acc_reward_index are computed.
- Zero-variance questions keep weight 1.
- For the rest, weights are (num_questions - num_zero_variance) * softmax(-mean_acc / dgpo_dqw_temp); questions with mean accuracy 0 or NaN are treated as “easiest” (mean set to 1 in the softmax) so they receive less weight.
- These weights are expanded to per-sample and multiplied onto advantages (no separate weighting in the loss).
Valid token-level loss averaging (when use_dgpo_dgae or use_dgpo_dqw is True)
- Valid sample: any sample for which std_rewards != 0 (i.e. ~is_std_zero).
- Before the per-process slice:
  - completion_length = completion_mask.sum(dim=1) is gathered across processes to get gathered_completion_length.
  - global_balancing_ratio = num_processes * local_completion_length_sum / global_completion_length_sum is computed (used later).
  - If there is at least one valid sample, zero_mask_ratio = global_completion_length_sum / valid_completion_length_sum, where valid_completion_length_sum is the sum of completion lengths over valid samples only; otherwise zero_mask_ratio = 1.0.
  - Full advantages are multiplied by zero_mask_ratio so that the effective normalizer ignores invalid (zero-variance) samples.
- After the slice that keeps only the local part of the data, local advantages are multiplied by global_balancing_ratio so that the loss is balanced across processes by valid token count (valid token-level averaging across devices).

Tests

test_training_dgpo in tests/test_grpo_trainer.py: Runs a short training with use_dgpo_dgae=True, use_dgpo_dqw=True, dgpo_dqw_temp=2.0, and dgpo_dqw_acc_reward_index=0, and checks that train_loss is recorded.

…ort to GRPO

LeonEricsson

a few comments.

also: _generate_and_score is getting too dense with this PR. DGAE/DQW + valid-token balancing logic and the existing multi-objective aggregation both add substantial branching/state. It's becoming hard to follow and validate each transformation in isolation.

I think it makes sense to pull most of these out into separate helpers?

docs/source/grpo_trainer.md

trl/trainer/grpo_trainer.py

LeonEricsson · 2026-02-19T10:20:02Z

trl/trainer/grpo_trainer.py

+                num_questions, device=advantages.device, dtype=advantages.dtype
+            )
+            if num_zero_variance_questions < num_questions:
+                # For mean accuracy 0 (all wrong) or NaN, set difficulty to -1 so they get less weight


Suggested change

# For mean accuracy 0 (all wrong) or NaN, set difficulty to -1 so they get less weight

# mean accuracy == 0 (all wrong) or NaN are remapped to 1.0 before softmax so they get less weight```

style nit

but also, doesn't this imply rewards have to be >0?

Thanks. This is only the judgment and operation for the accuracy reward. We default the accuracy reward range to [0,1].

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

Updated the DGPO section to clarify its mechanisms and usage in TRL, including details on DGAE and DQW.

Removed DGPO section and its related details from the documentation.

Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>

Updated comment to clarify handling of mean accuracy for questions.

YanqiDai · 2026-02-20T09:28:53Z

a few comments.

also: _generate_and_score is getting too dense with this PR. DGAE/DQW + valid-token balancing logic and the existing multi-objective aggregation both add substantial branching/state. It's becoming hard to follow and validate each transformation in isolation.

I think it makes sense to pull most of these out into separate helpers?

We have adopted and modified all the suggestions.

In particular, we have rewritten the code logic and separated the main functional implementations into helper functions: _compute_advantages_with_dgae, _compute_valid_token_balancing_ratios, and _compute_dqw_weights.

The final code also passed our test program.

LeonEricsson

few more comments

trl/trainer/grpo_config.py

trl/trainer/grpo_trainer.py

LeonEricsson · 2026-02-23T14:49:58Z

trl/trainer/grpo_trainer.py

+            if self.use_dgpo_dgae:
+                advantages = self._compute_advantages_with_dgae(
+                    rewards, num_generations
+                )


this split feels a bit asymmetric: the DGAE path goes into _compute_advantages_with_dgae while the standard advantage computation (center by mean, divide by std) stays inline. When reading the code you now have to jump to a separate method for one path but not the other, even though they're doing the same conceptual thing.

I think there's motivation for a larger refactoring of the advantage calculations LOC 1844-1890, but I'd like a maintainers thoughts on this. My suggestion would be to at least pull out std_rewards calculation into a helper, but open to a larger refactor as well.

Thank you for your suggestion. We have found that the implementation of _compute_advantages_with_dgae can be simplified and requires some minor adjustments. After making these corrections, we implemented it directly in _generate_and_score_completions (it only takes 3 lines of code, just as concise as using std_rewards).

Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>

Removed the repeated use_bias_correction_kl configuration option.

…nerate_and_score_completions

LeonEricsson

lgtm now. the only thing I would consider is some refactoring of the advantage calculation, as discussed here

needs a maintainers approval before merging

YanqiDai added 2 commits February 15, 2026 22:14

Add DGPO (Difficulty-Aware Group Policy Optimization, ICLR 2026) supp…

4a8bf02

…ort to GRPO

Merge branch 'main' into grpo-dgpo

0383a57

LeonEricsson reviewed Feb 19, 2026

View reviewed changes

YanqiDai and others added 8 commits February 20, 2026 14:51

Revise DGPO description and usage instructions

b0f72ef

Updated the DGPO section to clarify its mechanisms and usage in TRL, including details on DGAE and DQW.

Remove DGPO section from grpo_trainer.md

90fc3f5

Removed DGPO section and its related details from the documentation.

Remove ICLR 2026

34a69eb

Apply all other suggestions from code review

da8c445

Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>

Merge branch 'main' into grpo-dgpo

721c3eb

Polish the description of accuracy handling logic in DQW

25def33

Updated comment to clarify handling of mean accuracy for questions.

Rewrite the DGPO code

1ecdae6

Merge branch 'huggingface:main' into grpo-dgpo

6dd54db

YanqiDai added 3 commits February 20, 2026 17:30

Remove ICLR 2026

6291ae3

Recover the code position of is_std_zero

df1fb48

Merge branch 'main' into grpo-dgpo

e2b254d

LeonEricsson reviewed Feb 23, 2026

View reviewed changes

YanqiDai and others added 7 commits February 25, 2026 21:23

Merge branch 'main' into grpo-dgpo

fb95c87

Apply suggestions from code review

89e1384

Co-authored-by: LeonEricsson <70749762+LeonEricsson@users.noreply.github.com>

Remove repeated use_bias_correction_kl in suggestions from grpo_config

e07db90

Removed the repeated use_bias_correction_kl configuration option.

Modify _compute_advantages_with_dgae and implement it directly in _ge…

f34d3f1

…nerate_and_score_completions

Merge branch 'main' into grpo-dgpo

1535983

Merge branch 'main' into grpo-dgpo

2137c9d

Merge branch 'main' into grpo-dgpo

7c5acea

LeonEricsson approved these changes Feb 28, 2026

View reviewed changes

LeonEricsson requested review from albertvillanova and qgallouedec February 28, 2026 12:15

	# For mean accuracy 0 (all wrong) or NaN, set difficulty to -1 so they get less weight
	# mean accuracy == 0 (all wrong) or NaN are remapped to 1.0 before softmax so they get less weight```

Conversation

YanqiDai commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Motivation

Changes

Configuration (grpo_config.py)

Trainer (grpo_trainer.py)

Tests

Uh oh!

LeonEricsson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

LeonEricsson Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

LeonEricsson Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

YanqiDai Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

YanqiDai commented Feb 20, 2026

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

YanqiDai Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YanqiDai commented Feb 15, 2026 •

edited

Loading

Configuration (`grpo_config.py`)

Trainer (`grpo_trainer.py`)

LeonEricsson left a comment •

edited

Loading

YanqiDai Feb 25, 2026 •

edited

Loading