Skip to content

draft: support NemotronH model in HF path#943

Draft
Fridah-nv wants to merge 1 commit intomainfrom
fridah/super-ptq
Draft

draft: support NemotronH model in HF path#943
Fridah-nv wants to merge 1 commit intomainfrom
fridah/super-ptq

Conversation

@Fridah-nv
Copy link
Contributor

@Fridah-nv Fridah-nv commented Feb 27, 2026

What does this PR do?

Type of change: ?

Overview: ?

Usage

We can use NemotronHForCausalLM with MAMBA_MOE_NVFP4_CONSERVATIVE_CFG or MAMBA_MOE_NVFP4_AGGRESSIVE_CFG

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fridah/super-ptq

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

def _setup(self):
pass

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just moving what we have in super_ptq.py to here, can discuss more if we want to align this with other MoE Quantization behavior in ModelOPT

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.04%. Comparing base (4eacb0d) to head (63dca16).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #943      +/-   ##
==========================================
- Coverage   72.18%   72.04%   -0.14%     
==========================================
  Files         207      207              
  Lines       22656    22718      +62     
==========================================
+ Hits        16355    16368      +13     
- Misses       6301     6350      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
return super().forward(hidden_states)

def layer_sync_moe_local_experts_amax(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realAsma is this still necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is helpful for advanced algorithms. This method only sync the input quantizer amax - the correctly syncd input quantizer amax is required for MSE/GPTQ algorithms.

},
"algorithm": "max",
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

wildcards.add(name[:i] + "*")
for i in range(len(name)):
if name[i] == ".":
wildcards.add(name[:i] + "*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need adding it?

_prefix_wildcard_summarize_exclude_modules(
exclude_modules, per_layer_config["quantized_layers"].keys()
)
summarized = _prefix_wildcard_summarize_exclude_modules(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a no-op change?

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
return super().forward(hidden_states)

def layer_sync_moe_local_experts_amax(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fridah-nv could you please check the implementation of the supported MoE layers

class _QuantSparseMoe(QuantModule):
and see whether we can implement this function to the _QuantSparseMoe base class?

return output


class _QuantNemotronHMOE(QuantModule):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this - does this work -

def register_sparse_moe_on_the_fly(model):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, we dont need explicit registration for attention (it is caught by the attention AST patching). Is that correct?

amax_dict[name] = (
amax_tensor if stored is None else torch.maximum(stored, amax_tensor)
)
for expert in self.experts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you iterate through the key/values in the amax_dict instead of through the entire expert modules instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants