Fix double prepare_grads / loss-scaler-double-update in train_one_step#1842
Open
jthomy wants to merge 1 commit intoTHUDM:mainfrom
Open
Fix double prepare_grads / loss-scaler-double-update in train_one_step#1842jthomy wants to merge 1 commit intoTHUDM:mainfrom
jthomy wants to merge 1 commit intoTHUDM:mainfrom
Conversation
When args.check_for_nan_in_loss_and_grad=False, train_one_step called
optimizer.prepare_grads() and then optimizer.step(). Megatron's
MixedPrecisionOptimizer.step() calls prepare_grads() internally, so
prepare_grads ran twice per step. With fp16 + a grad scaler, that:
- advanced grad_scaler.update() twice per step, breaking dynamic loss
scaling cadence;
- on configurations where model_param.main_grad persists (typical DDP),
re-copied scaled grads into main grads and unscaled them again.
Additionally, when optimizer.step() returned (False, None, None) on a
grad-scaler overflow, the subsequent assert update_successful fired even
though that's a legitimate skipped-step signal, not a programming error.
Call optimizer.step() exactly once and derive the overflow signal from
its return values, skipping the LR scheduler advance and emitting a
warning when an overflow occurs. Move the MTP CI gradient check before
optimizer.step() since step() modifies gradients.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When training with fp16, and therefore args.check_for_nan_in_loss_and_grad=False, train_one_step calls optimizer.prepare_grads() and then optimizer.step(). Megatron's MixedPrecisionOptimizer.step() calls prepare_grads() internally, so prepare_grads runs twice per step causing a doubly unscaled unscaling, causing extremely small gradients for fp16.
Since optimizer.step() automatically skips the update when inf is detected, we can first run optimizer.step and check for if it was a valid step afterwards. The behavior for non-fp16 is unchanged.
Results with a GLM 4.7 30B Flash:



BF16 (for comparison)
FP16 without fix (zero means the grad scaler was too high and is reducing):
FP16 with fix: