Skip to content

Add support for MaxRL #5026

Open
catherinelee274 wants to merge 17 commits intohuggingface:mainfrom
catherinelee274:clee_maxrl
Open

Add support for MaxRL #5026
catherinelee274 wants to merge 17 commits intohuggingface:mainfrom
catherinelee274:clee_maxrl

Conversation

@catherinelee274
Copy link

@catherinelee274 catherinelee274 commented Feb 9, 2026

What does this PR do?

Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

Medium Risk
Changes core advantage normalization logic in GRPOTrainer, which can affect training stability/behavior; however it is gated behind an explicit new (experimental) config option and covered by new tests.

Overview
Adds an experimental MaxRL variant to GRPO by introducing scale_rewards="mean", which normalizes group advantages by the group mean reward (supported only when multi_objective_aggregation="sum_then_normalize").

Updates GRPOConfig documentation/help text, extends GRPOTrainer’s advantage-scaling logic and validation to accept the new mode, and adds a new examples/scripts/maxrl.py training script plus tests covering the new scaling option and a dedicated MaxRL training smoke test.

Written by Cursor Bugbot for commit ec6721e. This will update automatically on new commits. Configure here.

@catherinelee274 catherinelee274 changed the title Add support for MaxRL [WIP] Add support for MaxRL Feb 17, 2026
@catherinelee274 catherinelee274 marked this pull request as ready for review February 17, 2026 06:25
@LeonEricsson
Copy link
Collaborator

Doesn't MaxRL reduce to simply changing the advantage normalization denominator from std(r) to mean(r)?

# GRPO
A_i = (r_i - mean(r)) / (std(r) + eps)

# MaxRL
A_i = (r_i - mean(r)) / (mean(r) + eps)

If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module.

Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind writing a paper index section for MaxRL as well?

Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean  as they do not test TRL code
 test_maxrl_training_conversational
Copy link
Collaborator

@LeonEricsson LeonEricsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final comments. then i'm satisfied.
needs a maintainers approval before merging.

- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment

- removed comment in grpo_trainer, due to us having a paper index already
advantages = rewards - mean_grouped_rewards
if self.scale_rewards != "none":
if self.scale_rewards == "mean":
advantages = advantages / (mean_grouped_rewards + 1e-4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative mean reward silently inverts advantage signs

Medium Severity

When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.

Fix in Cursor Fix in Web

def extract_boxed(text: str) -> str | None:
"""Return the last \\boxed{...} content from text, or None if absent."""
matches = re.findall(r"\\boxed\{([^}]*)\}", text)
return matches[-1].strip() if matches else None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex fails to extract boxed content with nested braces

High Severity

The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.

Fix in Cursor Fix in Web

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

"scaling by the standard deviation introduces a question-level difficulty bias."
"scaling by the standard deviation introduces a question-level difficulty bias. "
"`'mean'` (experimental): rewards are scaled by the group mean, as in MaxRL "
"(https://arxiv.org/abs/2602.02710). Only supported with multi_objective_aggregation='sum_then_normalize'."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing validation for unsupported scale_rewards and aggregation combination

Medium Severity

The documentation and config metadata explicitly state scale_rewards="mean" is "only supported with multi_objective_aggregation='sum_then_normalize'", but no validation enforces this constraint. When normalize_then_sum is used, the scale_rewards parameter is completely ignored — the mean-scaling logic only exists in the sum_then_normalize branch. A user who sets scale_rewards="mean" with normalize_then_sum gets silent fallback to standard std-normalization, with no error or warning.

Additional Locations (1)

Fix in Cursor Fix in Web

# dependencies = [
# "trl",
# "math-verify",
# "latex2sympy2_extended",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused dependencies declared in script metadata

Low Severity

The inline script metadata declares math-verify and latex2sympy2_extended as dependencies, but neither package is imported or used anywhere in the script. The script uses a simple regex-based extract_boxed function instead. These were likely copied from another example script. Anyone running via uv run would unnecessarily install these packages.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Maximum Likelihood Reinforcement Learning

2 participants