Add support for MaxRL by catherinelee274 · Pull Request #5026 · huggingface/trl

catherinelee274 · 2026-02-09T13:32:32Z

What does this PR do?

Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Changes core advantage normalization logic in GRPOTrainer, which can affect training stability/behavior; however it is gated behind an explicit new (experimental) config option and covered by new tests.

Overview
Adds an experimental MaxRL variant to GRPO by introducing scale_rewards="mean", which normalizes group advantages by the group mean reward (supported only when multi_objective_aggregation="sum_then_normalize").

Updates GRPOConfig documentation/help text, extends GRPOTrainer’s advantage-scaling logic and validation to accept the new mode, and adds a new examples/scripts/maxrl.py training script plus tests covering the new scaling option and a dedicated MaxRL training smoke test.

^{Written by Cursor Bugbot for commit ec6721e. This will update automatically on new commits. Configure here.}

…o script

LeonEricsson · 2026-02-19T13:55:29Z

Doesn't MaxRL reduce to simply changing the advantage normalization denominator from std(r) to mean(r)?

# GRPO
A_i = (r_i - mean(r)) / (std(r) + eps)

# MaxRL
A_i = (r_i - mean(r)) / (mean(r) + eps)

If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module.

LeonEricsson

Would you mind writing a paper index section for MaxRL as well?

trl/trainer/grpo_trainer.py

tests/test_grpo_trainer.py

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

LeonEricsson

final comments. then i'm satisfied.
needs a maintainers approval before merging.

trl/trainer/grpo_trainer.py

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

cursor · 2026-03-05T04:03:58Z

trl/trainer/grpo_trainer.py

            advantages = rewards - mean_grouped_rewards
-            if self.scale_rewards != "none":
+            if self.scale_rewards == "mean":
+                advantages = advantages / (mean_grouped_rewards + 1e-4)


Negative mean reward silently inverts advantage signs

Medium Severity

When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.

cursor · 2026-03-05T04:03:58Z

examples/scripts/maxrl.py

+def extract_boxed(text: str) -> str | None:
+    """Return the last \\boxed{...} content from text, or None if absent."""
+    matches = re.findall(r"\\boxed\{([^}]*)\}", text)
+    return matches[-1].strip() if matches else None


Regex fails to extract boxed content with nested braces

High Severity

The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-07T23:40:23Z

trl/trainer/grpo_config.py

-            "scaling by the standard deviation introduces a question-level difficulty bias."
+            "scaling by the standard deviation introduces a question-level difficulty bias. "
+            "`'mean'` (experimental): rewards are scaled by the group mean, as in MaxRL "
+            "(https://arxiv.org/abs/2602.02710). Only supported with multi_objective_aggregation='sum_then_normalize'."


Missing validation for unsupported scale_rewards and aggregation combination

Medium Severity

The documentation and config metadata explicitly state scale_rewards="mean" is "only supported with multi_objective_aggregation='sum_then_normalize'", but no validation enforces this constraint. When normalize_then_sum is used, the scale_rewards parameter is completely ignored — the mean-scaling logic only exists in the sum_then_normalize branch. A user who sets scale_rewards="mean" with normalize_then_sum gets silent fallback to standard std-normalization, with no error or warning.

Additional Locations (1)

trl/trainer/grpo_trainer.py#L1893-L1895

cursor · 2026-03-07T23:40:23Z

examples/scripts/maxrl.py

+# dependencies = [
+#     "trl",
+#     "math-verify",
+#     "latex2sympy2_extended",


Unused dependencies declared in script metadata

Low Severity

The inline script metadata declares math-verify and latex2sympy2_extended as dependencies, but neither package is imported or used anywhere in the script. The script uses a simple regex-based extract_boxed function instead. These were likely copied from another example script. Anyone running via uv run would unnecessarily install these packages.

catherinelee274 added 6 commits February 9, 2026 08:31

Add support for MaxRL

ac87cc7

Merge branch 'main' into clee_maxrl

cb163ef

Add tests

5fc482e

Update maxrl trainer to use from ..models.utils import

cf52d2d

Merge branch 'huggingface:main' into clee_maxrl

1db77b8

Update config to pass tests, and rename script and add more details t…

4d38e63

…o script

catherinelee274 changed the title ~~Add support for MaxRL [WIP]~~ Add support for MaxRL Feb 17, 2026

catherinelee274 marked this pull request as ready for review February 17, 2026 06:25

catherinelee274 added 2 commits February 19, 2026 17:54

Update maxrl to not have trainer but be variant of grpo

3604dd5

Merge branch 'main' into clee_maxrl

d3fd0a1

LeonEricsson reviewed Feb 20, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 3 commits February 25, 2026 03:29

Add paper index and add test and remove abs

48ccdf3

Merge branch 'main' into clee_maxrl

1a5ac6c

Merge branch 'main' into clee_maxrl

528d580

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

catherinelee274 added 2 commits February 26, 2026 07:08

Remove tests specified in comments and section header

c6597fd

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational

Merge branch 'main' into clee_maxrl

8b748bf

LeonEricsson reviewed Feb 26, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

trl/trainer/grpo_trainer.py Outdated Show resolved Hide resolved

catherinelee274 added 3 commits February 28, 2026 08:28

Remove some commented sections

a2299b5

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already

Merge branch 'main' into clee_maxrl

a5c7888

Merge branch 'main' into clee_maxrl

4dbc89c

catherinelee274 requested a review from LeonEricsson March 5, 2026 04:00

cursor bot reviewed Mar 5, 2026

View reviewed changes

Merge branch 'main' into clee_maxrl

ec6721e

cursor bot reviewed Mar 7, 2026

View reviewed changes

Conversation

catherinelee274 commented Feb 9, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

LeonEricsson commented Feb 19, 2026

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LeonEricsson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Negative mean reward silently inverts advantage signs

Uh oh!

cursor bot Mar 5, 2026

Choose a reason for hiding this comment

Regex fails to extract boxed content with nested braces

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 7, 2026

Choose a reason for hiding this comment

Missing validation for unsupported scale_rewards and aggregation combination

Uh oh!

cursor bot Mar 7, 2026

Choose a reason for hiding this comment

Unused dependencies declared in script metadata

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catherinelee274 commented Feb 9, 2026 •

edited by cursor bot

Loading