Conversation
|
Doesn't MaxRL reduce to simply changing the advantage normalization denominator from If so, this fits naturally as a flag in the existing GRPO trainer rather than a dedicated experimental module. |
LeonEricsson
left a comment
There was a problem hiding this comment.
Would you mind writing a paper index section for MaxRL as well?
Remove test_test_maxrl_advantage_normalization, test_maxrl_advantage_zero_mean as they do not test TRL code test_maxrl_training_conversational
LeonEricsson
left a comment
There was a problem hiding this comment.
final comments. then i'm satisfied.
needs a maintainers approval before merging.
- Remoe # MaxRL: A_i = (r_i - mean(r)) / (mean(r) + eps) comment - removed comment in grpo_trainer, due to us having a paper index already
| advantages = rewards - mean_grouped_rewards | ||
| if self.scale_rewards != "none": | ||
| if self.scale_rewards == "mean": | ||
| advantages = advantages / (mean_grouped_rewards + 1e-4) |
There was a problem hiding this comment.
Negative mean reward silently inverts advantage signs
Medium Severity
When scale_rewards="mean", advantages are divided by mean_grouped_rewards + 1e-4. Unlike std_rewards (always ≥ 0), mean_grouped_rewards can be negative when rewards are not binary (e.g., from a reward model). If the group mean is below -1e-4, the denominator becomes negative, silently flipping all advantage signs — the model would then reinforce bad completions and penalize good ones. The docs say this is for binary rewards, but no runtime validation enforces that constraint. Using abs(mean_grouped_rewards) or clamping to non-negative in the denominator would prevent this silent inversion.
| def extract_boxed(text: str) -> str | None: | ||
| """Return the last \\boxed{...} content from text, or None if absent.""" | ||
| matches = re.findall(r"\\boxed\{([^}]*)\}", text) | ||
| return matches[-1].strip() if matches else None |
There was a problem hiding this comment.
Regex fails to extract boxed content with nested braces
High Severity
The extract_boxed regex r"\\boxed\{([^}]*)\}" uses [^}]* which stops at the first closing brace. Math answers almost always contain nested braces (e.g. \boxed{\frac{1}{3}}), and this regex would extract only \frac{1 instead of \frac{1}{3}. Since the accuracy_reward function relies on extract_boxed for answer matching, correct answers with fractions, exponents, or any nested LaTeX will never receive a reward of 1.0, effectively making the training signal broken for the majority of math problems.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| "scaling by the standard deviation introduces a question-level difficulty bias." | ||
| "scaling by the standard deviation introduces a question-level difficulty bias. " | ||
| "`'mean'` (experimental): rewards are scaled by the group mean, as in MaxRL " | ||
| "(https://arxiv.org/abs/2602.02710). Only supported with multi_objective_aggregation='sum_then_normalize'." |
There was a problem hiding this comment.
Missing validation for unsupported scale_rewards and aggregation combination
Medium Severity
The documentation and config metadata explicitly state scale_rewards="mean" is "only supported with multi_objective_aggregation='sum_then_normalize'", but no validation enforces this constraint. When normalize_then_sum is used, the scale_rewards parameter is completely ignored — the mean-scaling logic only exists in the sum_then_normalize branch. A user who sets scale_rewards="mean" with normalize_then_sum gets silent fallback to standard std-normalization, with no error or warning.
Additional Locations (1)
| # dependencies = [ | ||
| # "trl", | ||
| # "math-verify", | ||
| # "latex2sympy2_extended", |
There was a problem hiding this comment.
Unused dependencies declared in script metadata
Low Severity
The inline script metadata declares math-verify and latex2sympy2_extended as dependencies, but neither package is imported or used anywhere in the script. The script uses a simple regex-based extract_boxed function instead. These were likely copied from another example script. Anyone running via uv run would unnecessarily install these packages.


What does this PR do?
Adds Maxrl which is a variant of grpo with p-normalization.
Fixes #5025
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Changes core advantage normalization logic in
GRPOTrainer, which can affect training stability/behavior; however it is gated behind an explicit new (experimental) config option and covered by new tests.Overview
Adds an experimental MaxRL variant to GRPO by introducing
scale_rewards="mean", which normalizes group advantages by the group mean reward (supported only whenmulti_objective_aggregation="sum_then_normalize").Updates
GRPOConfigdocumentation/help text, extendsGRPOTrainer’s advantage-scaling logic and validation to accept the new mode, and adds a newexamples/scripts/maxrl.pytraining script plus tests covering the new scaling option and a dedicated MaxRL training smoke test.Written by Cursor Bugbot for commit ec6721e. This will update automatically on new commits. Configure here.