Skip to content

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.42.0 Release

09 Mar 20:31
e2a4a8b

Choose a tag to compare

Bug Fixes

  • Fix calibration data generation with multiple samples in the ONNX workflow.

New Features

  • Added a standalone type inference option (--use_standalone_type_inference) to ONNX AutoCast as an experimental alternative to ONNX's infer_shapes. This option performs type-only inference without shape inference, which can help when shape inference fails or when you want to avoid extra shape inference overhead.
  • Added quantization support for the Kimi K2 Thinking model from the original int4 checkpoint.
  • Introduced support for params constraint-based automatic neural architecture search in Minitron pruning (mcore_minitron) as an alternative to manual pruning with export_config. See examples/pruning/README.md for more details.
  • Example added for Minitron pruning using the Megatron-Bridge framework, including advanced pruning usage with params-constraint-based pruning and a new distillation example. See examples/megatron_bridge/README.md.
  • Supported calibration data with multiple samples in .npz format in the ONNX Autocast workflow.
  • Added the --opset option to the ONNX quantization CLI to specify the target opset version for the quantized model.
  • Enabled support for context parallelism in Eagle speculative decoding for both HuggingFace and Megatron Core models.
  • Added unified Hugging Face export support for diffusers pipelines/components.
  • Added support for LTX-2 and Wan2.2 (T2V) in the diffusers quantization workflow.
  • Provided PTQ support for GLM-4.7, including loading MTP layer weights from a separate mtp.safetensors file and supporting export as-is.
  • Added support for image-text data calibration in PTQ for Nemotron VL models.
  • Enabled advanced weight scale search for NVFP4 quantization and its export pathway.
  • Provided PTQ support for Nemotron Parse.
  • Added distillation support for LTX-2. See examples/diffusers/distillation/README.md for more details.

0.42.0rc2

28 Feb 18:32
eaf5d7e

Choose a tag to compare

0.42.0rc2 Pre-release
Pre-release

Install the 0.42.0rc2 pre-release version using

pip install nvidia-modelopt[all]==0.42.0rc2 --extra-index-url https://pypi.nvidia.com

0.42.0rc1

21 Feb 14:50
f08a65f

Choose a tag to compare

0.42.0rc1 Pre-release
Pre-release

Install the 0.42.0rc1 pre-release version using

pip install nvidia-modelopt==0.42.0rc1 --extra-index-url https://pypi.nvidia.com

0.42.0rc0

04 Feb 05:34
87237e7

Choose a tag to compare

0.42.0rc0 Pre-release
Pre-release

Install the 0.42.0rc0 pre-release version using

pip install nvidia-modelopt==0.42.0rc0 --extra-index-url https://pypi.nvidia.com

ModelOpt 0.41.0 Release

20 Jan 17:10
d39cf45

Choose a tag to compare

Bug Fixes

  • Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

New Features

  • Add support for Transformer Engine quantization for Megatron Core models.
  • Add support for Qwen3-Next model quantization.
  • Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
  • Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
  • Add support for subgraphs in ONNX autocast.
  • Add support for parallel draft heads in Eagle speculative decoding.
  • Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
  • Add examples/llm_qad for QAD training with Megatron-LM.

Deprecations

  • Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.

Backward Breaking Changes

  • Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.

0.41.0rc3

20 Jan 05:12
d39cf45

Choose a tag to compare

0.41.0rc3 Pre-release
Pre-release

0.41.0rc3

0.41.0rc2

14 Jan 04:59
41aaec5

Choose a tag to compare

0.41.0rc2 Pre-release
Pre-release

0.41.0rc2

0.41.0rc1

05 Jan 13:54
8426c36

Choose a tag to compare

0.41.0rc1 Pre-release
Pre-release

0.41.0rc1

ModelOpt 0.40.0 Release

12 Dec 10:27
411912e

Choose a tag to compare

Bug Fixes

  • Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
  • Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.

New Features

  • Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for num_moe_experts, moe_ffn_hidden_size, and moe_shared_expert_intermediate_size parameters in Minitron pruning (mcore_minitron).
  • Add specdec_bench example to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details.
  • Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
  • Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
  • Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
  • Add flag trt_plugins_precision in ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow.
  • Add support for PyTorch Geometric quantization.
  • Add per tensor and per channel MSE calibrator support.
  • Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.

Documentation

Misc

  • NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer.
  • Bump TensorRT-LLM test docker to 1.2.0rc4.
  • Bump minimum recommended transformers version to 4.53.
  • Replace ONNX simplification package from onnxsim to onnxslim.

ModelOpt 0.39.0 Release

13 Nov 07:25
f329b19

Choose a tag to compare

Deprecations

  • Deprecated modelopt.torch._deploy.utils.get_onnx_bytes API. Please use modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata instead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.

New Features

  • Added flag op_types_to_exclude_fp16 in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating 'fp32' precision in trt_plugins_precision.
  • Added LoRA mode support for MCore in a new peft submodule: modelopt.torch.peft.update_model(model, LORA_CFG).
  • Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See examples/vllm_serve for more details.
  • Added support for nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in examples/llm_ptq. Defaults to a mix of cnn_dailymail and nemotron-post-training-dataset-v2 (gated dataset accessed using the HF_TOKEN environment variable) if no dataset is specified.
  • Allows specifying calib_seq in examples/llm_ptq to set the maximum sequence length for calibration.
  • Added support for MCore MoE PTQ/QAT/QAD.
  • Added support for multi-node PTQ and export with FSDP2 in examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.
  • Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
  • Added flags nodes_to_include and op_types_to_include in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
  • Added support for torch.compile and benchmarking in examples/diffusers/quantization/diffusion_trt.py.
  • Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
  • Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
  • Added support for exporting QLoRA checkpoints finetuned using ModelOpt.

Documentation

Additional Announcements

  • ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named 0.40.0 instead of 0.41.0