Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
| def _setup(self): | ||
| pass | ||
|
|
||
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
I'm just moving what we have in super_ptq.py to here, can discuss more if we want to align this with other MoE Quantization behavior in ModelOPT
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #943 +/- ##
==========================================
- Coverage 72.18% 72.04% -0.14%
==========================================
Files 207 207
Lines 22656 22718 +62
==========================================
+ Hits 16355 16368 +13
- Misses 6301 6350 +49 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||
| return super().forward(hidden_states) | ||
|
|
||
| def layer_sync_moe_local_experts_amax(self): |
There was a problem hiding this comment.
This is helpful for advanced algorithms. This method only sync the input quantizer amax - the correctly syncd input quantizer amax is required for MSE/GPTQ algorithms.
| }, | ||
| "algorithm": "max", | ||
| } | ||
|
|
| wildcards.add(name[:i] + "*") | ||
| for i in range(len(name)): | ||
| if name[i] == ".": | ||
| wildcards.add(name[:i] + "*") |
There was a problem hiding this comment.
Do you know why we need adding it?
| _prefix_wildcard_summarize_exclude_modules( | ||
| exclude_modules, per_layer_config["quantized_layers"].keys() | ||
| ) | ||
| summarized = _prefix_wildcard_summarize_exclude_modules( |
| def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||
| return super().forward(hidden_states) | ||
|
|
||
| def layer_sync_moe_local_experts_amax(self): |
There was a problem hiding this comment.
@Fridah-nv could you please check the implementation of the supported MoE layers
and see whether we can implement this function to the_QuantSparseMoe base class?
| return output | ||
|
|
||
|
|
||
| class _QuantNemotronHMOE(QuantModule): |
There was a problem hiding this comment.
Do we still need this - does this work -
There was a problem hiding this comment.
In my understanding, we dont need explicit registration for attention (it is caught by the attention AST patching). Is that correct?
| amax_dict[name] = ( | ||
| amax_tensor if stored is None else torch.maximum(stored, amax_tensor) | ||
| ) | ||
| for expert in self.experts: |
There was a problem hiding this comment.
can you iterate through the key/values in the amax_dict instead of through the entire expert modules instead?
What does this PR do?
Type of change: ?
Overview: ?
Usage
We can use
NemotronHForCausalLMwithMAMBA_MOE_NVFP4_CONSERVATIVE_CFGorMAMBA_MOE_NVFP4_AGGRESSIVE_CFG# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Additional Information