Fix Muon optimizer conflict with gradient clipping in ZeRO 1/2 by fy817 · Pull Request #7776 · deepspeedai/DeepSpeed

fy817 · 2026-01-12T11:44:09Z

Summary

This PR fixes a critical logic error in how gradient clipping interacts with the Muon optimizer within the ZeRO Stage 1 and 2 implementations.

Currently, the muon_update (orthogonalization) is applied before gradient clipping. Since Muon's orthogonalization naturally changes the gradient magnitude (often resulting in norms significantly larger than the default clip_grad of 1.0), the subsequent standard clipping mechanism incorrectly identifies these valid updates as exploding gradients. This results in aggressive, erroneous clipping that severely degrades or breaks training.

This fix ensures that gradient clipping is applied to the raw gradients before the Muon update occurs, ensuring the orthogonalization step receives properly clipped inputs, while preventing double-clipping during the optimizer step.

Root Cause

In the current implementation, muon_update is invoked inside get_flat_partition to leverage the shaped gradients required by the algorithm. However, get_flat_partition is called before the standard global gradient clipping phase.

Muon Characteristics: The Muon optimizer performs an orthogonalization step (via Newton-Schulz iteration), which typically results in gradient tensors with Frobenius norms much larger than 1.0.
Clipping Conflict: DeepSpeed defaults to a clip_grad value of 1.0. Because the gradients have already been processed by Muon by the time clipping occurs, their norms are naturally high.
Result: The global clipper perceives these high norms as gradient explosion and scales the gradients down by a massive factor (e.g., dividing by 100+), effectively neutralizing the Muon update and causing training failure.

Fix

The fix rearranges the clipping logic for parameters utilizing the Muon optimizer:

Pre-computation: In independent_gradient_partition_epilogue, we now pre-calculate the unscaled global gradient norm (or the specific norm required for clipping) before the flattening process begins.
Pre-Update Clipping: This calculated norm is passed into get_flat_partition. Inside this method, we calculate the clip_factor and apply it (grad = grad / clip_factor) explicitly to the gradients before passing them to muon_update. This ensures Muon operates on clipped, stable gradients.
Skip Redundant Clipping: In the step method, we modified the call to unscale_and_clip_grads to accept a skip_clipping=True flag. For Muon parameter groups, we set this flag to true, ensuring that we only perform unscaling (handling loss scale) without re-applying gradient clipping, as it was already handled in the previous step. Standard AdamW parameters retain the original behavior.

Signed-off-by: fy817 <277645218@qq.com>

sfc-gh-truwase · 2026-01-13T16:12:47Z

@PKUWZP, please help review

fy817 · 2026-01-19T03:37:56Z

Hi @PKUWZP, could you please take a look when you have a moment? Let me know if you need anything else from my side. Thanks!

fy817 · 2026-01-25T11:44:45Z

Hi @sfc-gh-truwase, it seems @PKUWZP might be busy at the moment. Would it be possible for you or another maintainer to take a look? Thanks for your help!

sfc-gh-truwase · 2026-01-25T16:13:27Z

@fy817 can you please resolve the conflict?

fy817 · 2026-01-25T16:26:59Z

@sfc-gh-truwase Thank you for the feedback. The conflicts have been resolved. Please refer to my comment above for details on how I handled the conflicts. Looking forward to your review.

sfc-gh-truwase · 2026-01-25T16:29:53Z

The conflicts have been resolved. Please refer to my comment above for details on how I handled the conflicts. Looking forward to your review

Can you please clarify how the conflicts were resolved, since GH is still reporting a conflict

fy817 · 2026-01-25T16:38:54Z

@sfc-gh-truwase Sorry about that! I misunderstood the context and missed the merge conflict. It's been fixed now. Thanks for the heads-up.

fy817 · 2026-01-29T11:26:23Z

@sfc-gh-truwase Hi, I noticed that this PR has been linked with #7808. Please let me know if there is anything else needed from my side. I am happy to help get this merged.

sfc-gh-truwase · 2026-01-29T17:39:09Z

deepspeed/runtime/zero/stage_1_and_2.py

+            grad_norm_for_muon = None
+            if self.is_gradient_accumulation_boundary:
+                # Check if any parameter group uses Muon
+                uses_muon = False


The main connection to #7808 is that use_muon should be an object field set in the constructor and defined per-param_group.

PKUWZP · 2026-01-31T04:03:40Z

@fy817 Thanks so much for submitting the PR. I have reviewed the PR and left a few comments. One fundamental question I have is: do we actually need "naive gradient clipping" for Muon Optimizer? I feel that it might not even needed for Muon. Here are my thoughts:

In Muon Optimizer Implementation (shown below), the Newton-Schulz iteration already normalizes by spectral norm internally. The output is approximately an orthogonal matrix whose norm is determined by the matrix shape, not the input gradient magnitude.

def zeropower_via_newtonschulz5(G, steps: int):
X = G.bfloat16()
# Ensure spectral norm is at most 1
X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
# Newton-Schultz Iterations
for _ in range(steps):
A = X @ X.mT
B = b * A + c * A @ A
X = a * X + B @ X
return X

This means:

Exploding gradients are already handled - the spectral norm division bounds the input
The output norm is shape-dependent, not gradient-dependent - no need to clipping the output
Performing pre-clipping changes gradient direction unnecessarily - and the magnitude will be normalized anyway

Instead, I think MuonClilp (described in Kimi K2 technical report: https://arxiv.org/abs/2507.20534 ) is actually interesting for this case. It rescales query and key matrices within attention layers after each update to further improve training stability. I just don't feel doing naive pre-clipping is needed here for Muon.

Also in this PR, the global norm is computed via scaled_global_norm() on all averaged_gradients, this is not suitable for mixed setup (e.g. Muon for hidden layers while AdamW for embedding layers ). If you do this, the global norm will include the orthogonalized Muon gradients, which would not be the right thing to do to clip the AdamW parameter groups. Thoughts?

PKUWZP

@fy817 Thanks so much for submitting the PR. I have reviewed the PR and left a few comments. One fundamental question I have is: do we actually need "naive gradient clipping" for Muon Optimizer? I feel that it might not even needed for Muon. Here are my thoughts:

In Muon Optimizer Implementation (shown below), the Newton-Schulz iteration already normalizes by spectral norm internally. The output is approximately an orthogonal matrix whose norm is determined by the matrix shape, not the input gradient magnitude.

def zeropower_via_newtonschulz5(G, steps: int):
X = G.bfloat16()
# Ensure spectral norm is at most 1
X = X / (X.norm(dim=(-2, -1), keepdim=True) + 1e-7)
# Newton-Schultz Iterations
for _ in range(steps):
A = X @ X.mT
B = b * A + c * A @ A
X = a * X + B @ X
return X

This means:

Exploding gradients are already handled - the spectral norm division bounds the input
The output norm is shape-dependent, not gradient-dependent - no need to clipping the output
Performing pre-clipping changes gradient direction unnecessarily - and the magnitude will be normalized anyway

Instead, I think MuonClilp (described in Kimi K2 technical report: https://arxiv.org/abs/2507.20534 ) is actually interesting for this case. It rescales query and key matrices within attention layers after each update to further improve training stability. I just don't feel doing naive pre-clipping is needed here for Muon.

Also in this PR, the global norm is computed via scaled_global_norm() on all averaged_gradients, this is not suitable for mixed setup (e.g. Muon for hidden layers while AdamW for embedding layers ). If you do this, the global norm will include the orthogonalized Muon gradients, which would not be the right thing to do to clip the AdamW parameter groups. Thoughts?

PKUWZP · 2026-01-30T14:43:57Z

deepspeed/runtime/zero/stage_1_and_2.py

+            grad_norm_for_muon = None
+            if self.is_gradient_accumulation_boundary:
+                # Check if any parameter group uses Muon
+                uses_muon = False


@fy817 Like @sfc-gh-truwase mentioned, use_muon should be an object field. This should be computed once and stored as a class field. Computing this repeatedly is inefficiency and can cause errors.

PKUWZP · 2026-01-30T14:48:32Z

deepspeed/runtime/zero/stage_1_and_2.py

+                # Check if any parameter group uses Muon
+                uses_muon = False
+                for i, _ in enumerate(self.bit16_groups):
+                    if len(self.params_in_partition[i]) > 0 and getattr(self.params_in_partition[i][0], 'use_muon', False):


@fy817 getattr(tensor_list[0], 'use_muon', False) needs to be set externally by the user. Otherwise there's no documentation and validation that this attribute would exist or be set properly.

PKUWZP · 2026-01-30T14:52:35Z

deepspeed/runtime/zero/stage_1_and_2.py

+                uses_muon = False
+                for i, _ in enumerate(self.bit16_groups):
+                    if len(self.params_in_partition[i]) > 0 and getattr(self.params_in_partition[i][0], 'use_muon', False):
+                        uses_muon = True


@fy817 One suggestion is that add validation in __init__ of the DeepSpeedZeroOptimizer class that checks for use_muon attributes when a Muon optimizer is detected, and add documentation about this requirement.

PKUWZP · 2026-01-30T15:03:41Z

deepspeed/runtime/zero/stage_1_and_2.py


        if self.cpu_offload is False:
+            # Pre-compute gradient norm for Muon clipping if needed
+            grad_norm_for_muon = None


Here you introduced a new instance variable grad_norm_for_muon that's set in independent_gradient_partition_epilogue function and consumed in get_flat_partition function. This would potentially cause data racing problem due to the resulting implicit dependency on the call orders. One solution is that we make the grad_norm_for_muon defined explicitly outside the function, then we can ensure that it's always valid before being passed into functions.

fy817 · 2026-01-31T04:32:34Z

@sfc-gh-truwase, @PKUWZP, thank you for your responses. Below are some of my thoughts:

Indeed, the Muon optimizer may not require gradient clipping. My experiments also confirm this—setting the gradient clipping value to 0 resolves the aforementioned conflict between the Muon optimizer and gradient clipping, and normal training performance can still be achieved.
However, I would like to point out that the conflict between the Muon optimizer and gradient clipping still exists in the current implementation. First, users might overlook the gradient clipping setting. Since DeepSpeed defaults to a gradient clipping value of 1.0, this can lead to incorrect gradient modification due to clipping, ultimately causing training failures—a problem that still needs to be addressed. Second, even if we resolve the conflict by setting gradient clipping to 0, while Muon itself may not need clipping, users might still require gradient clipping for the Adam update portion of the optimizer.
Thank you very much for pointing out the shortcomings in the current PR. I look forward to further discussion to refine this solution.

fy817 requested review from tjruwase and tohtana as code owners January 12, 2026 11:44

Fix Muon optimizer conflict with gradient clipping in ZeRO 1/2

1a693b5

Signed-off-by: fy817 <277645218@qq.com>

fy817 force-pushed the fix-muon-grad-clip-conflict branch from c63f84f to 1a693b5 Compare January 12, 2026 11:55

loadams added 2 commits January 12, 2026 19:44

Merge branch 'master' into fix-muon-grad-clip-conflict

aa13d03

Merge branch 'master' into fix-muon-grad-clip-conflict

1e8b95c

sfc-gh-truwase requested a review from PKUWZP January 13, 2026 16:12

Merge branch 'master' into fix-muon-grad-clip-conflict

2ce3503

sfc-gh-truwase mentioned this pull request Jan 25, 2026

fix: Ensure full gradient reduction for Muon with reduce_scatter #7808

Closed

fy817 and others added 2 commits January 28, 2026 22:27

Merge branch 'master' into fix-muon-grad-clip-conflict

b9d09ad

Merge branch 'master' into fix-muon-grad-clip-conflict

fad0bed

sfc-gh-truwase reviewed Jan 29, 2026

View reviewed changes

PKUWZP reviewed Jan 31, 2026

View reviewed changes

Conversation

fy817 commented Jan 12, 2026

Summary

Root Cause

Fix

Uh oh!

sfc-gh-truwase commented Jan 13, 2026

Uh oh!

fy817 commented Jan 19, 2026

Uh oh!

fy817 commented Jan 25, 2026

Uh oh!

sfc-gh-truwase commented Jan 25, 2026

Uh oh!

fy817 commented Jan 25, 2026

Uh oh!

sfc-gh-truwase commented Jan 25, 2026

Uh oh!

fy817 commented Jan 25, 2026

Uh oh!

fy817 commented Jan 29, 2026

Uh oh!

sfc-gh-truwase Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP commented Jan 31, 2026

Uh oh!

PKUWZP left a comment

Choose a reason for hiding this comment

Uh oh!

PKUWZP Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

PKUWZP Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

fy817 commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants