NVIDIA · h-guo18 · Feb 21, 2026 · Feb 21, 2026 · Feb 21, 2026 · Feb 27, 2026
diff --git a/docs/source/guides/5_speculative_decoding.rst b/docs/source/guides/5_speculative_decoding.rst
@@ -2,127 +2,59 @@
 Speculative Decoding
 ====================
 
-Introduction
-============
-
 ModelOpt's Speculative Decoding module (:mod:`modelopt.torch.speculative <modelopt.torch.speculative>`)
-enables your model to generate multiple tokens in each generate step. This can be useful for reducing the
-latency of your model and speeds up inference.
-
-Below are the speculative decoding algorithms supported by ModelOpt:
-- Medusa
-- EAGLE
-
-
-Follow the steps described below to obtain a model with Medusa or EAGLE speculative decoding using ModelOpt's
-Speculative Decoding module :mod:`modelopt.torch.speculative`:
-
-#.  **Convert your model via** :meth:`mtsp.convert <modelopt.torch.speculative.speculative_decoding.convert>`:
-    Add Medusa heads or EAGLE module to your model.
-#.  **Fine-tune Medusa heads or EAGLE module**: Fine-tune the Medusa heads or EAGLE module.
-    The base model is recommended to be frozen.
-#.  **Checkpoint and re-load**: Save the model via :meth:`mto.save <modelopt.torch.opt.conversion.save>` and
-    restore via :meth:`mto.restore <modelopt.torch.opt.conversion.restore>`
-
-.. _speculative_conversion:
-
-Convert
-=======
-
-You can convert your model to a speculative decoding model using :meth:`mtsp.convert()
-<modelopt.torch.speculative.speculative_decoding.convert>`.
-
-Example usage:
-
-.. code-block:: python
+enables your model to generate multiple tokens in each generate step, reducing inference latency.
 
-    import torch
-    from transformers import AutoModelForCausalLM, AutoTokenizer
-    import modelopt.torch.speculative as mtsp
+ModelOpt implements the **EAGLE3** algorithm, which attaches a lightweight autoregressive draft
+module to a frozen base model. The draft module operates at the feature level—predicting future
+hidden states rather than tokens directly—to achieve high acceptance rates at low compute cost.
 
-    # User-defined model
-    model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
-    tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
-    tokenizer.pad_token_id = tokenizer.eos_token_id
 
-    if mode == "medusa":
-        # Configure and convert to medusa
-        config = {
-            "medusa_num_heads": 2,
-            "medusa_num_layers": 1,
-            }
-    elif mode == "eagle":
-        config = {
-            "eagle_num_layers": 1
-            }
-    mtsp.convert(model, [(mode, config)])
 
+.. toctree::
+    :maxdepth: 1
+    :caption: Module Guide
 
-Fine-tune speculative decoding model and store/restore the model
-----------------------------------------------------------------
+    ./_speculative_module_guide.rst
 
-After converting to a speculative decoding model, you need to fine-tune the decoding module:
+.. toctree::
+    :maxdepth: 1
+    :caption: EAGLE
 
-.. code-block:: python
+    ./_eagle_workflow.rst
+    ./_eagle_config_reference.rst
+    ./_eagle_best_practices.rst
 
-    import os
-    from transformers import Trainer
-    import modelopt.torch.opt as mto
-
-    mto.enable_huggingface_checkpointing()
-
-    trainer = Trainer(model=model, processing_class=tokenizer, args=training_args, **data_module)
-    trainer._move_model_to_device(model, trainer.args.device)
-
-    trainer.train(resume_from_checkpoint=checkpoint)
-    trainer.save_state()
-    trainer.save_model("<path to the output directory>")
-
-
-To restore the saved speculative decoding model:
-
-.. code-block:: python
-
-    model = AutoModelForCausalLM.from_pretrained("<path to the output directory>")
 
 .. _speculative-concepts:
 
 Speculative Decoding Concepts
-=============================
-
-Below, we will provide an overview of ModelOpt's speculative decoding feature as well as its basic
-concepts and terminology.
+==============================
 
 Speculative decoding
 --------------------
+
 The standard way of generating text from a language model is with autoregressive decoding: one token
 is generated each step and appended to the input context for the next token generation. This means
 to generate *K* tokens it will take *K* serial runs of the model. Inference from large autoregressive
 models like Transformers can be slow and expensive. Therefore, various *speculative decoding* algorithms
 have been proposed to accelerate text generation, especially in latency critical applications.
 
-Typically, a short draft of length *K* is generated using a faster, auto-regressive model, called draft
-model. This can be attained with either a parallel model or by calling the draft model *K* times.
-Then, a larger and more powerful model, called target model, is used to score the draft. Last, a sampling
-scheme is used to decide which draft to accept by the target model, recovering the distribution of the
-target model in the process.
+Typically, a short draft of length *K* is generated using a faster model, called the *draft model*.
+Then, a larger and more powerful model, called the *target model*, verifies the draft in a single
+forward pass. A sampling scheme decides which draft tokens to accept, recovering the output
+distribution of the target model in the process.
 
-Medusa algorithm
+EAGLE3 algorithm
 ----------------
 
-There are many ways to achieve speculative decoding. A popular approach is Medusa where instead of
-using an additional draft model, it introduces a few additional decoding heads to predict multiple
-future tokens simultaneously. During generation, these heads each produce multiple likely words for
-the corresponding position. These options are then combined and processed using a tree-based attention
-mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from
-the candidates for further decoding. Since the draft model is the target model itself, this guarantees
-the output distribution is the same as that of the target model.
-
-EAGLE algorithm
----------------
-
-Unlike Medusa that predicts future tokens based on the base model hidden states, EAGLE predicts
-future hidden states through a lightweight autoregressive decoder, which is then used to
-predict the future tokens. Since autoregression at the feature (hidden states) level is simpler
-than at the token level, EAGLE can predict future tokens more accurately than Medusa. Therefore, it
-achieves higher speedup.
+EAGLE3 attaches a lightweight autoregressive decoder (the draft module) to a frozen base model.
+Unlike token-level autoregression, the draft module operates at the *feature level*: it predicts
+future hidden states, which are then projected to token logits. Autoregression over hidden states
+is an easier task than over tokens, so the draft module achieves high prediction accuracy with low
+compute cost.
+
+Compared to earlier EAGLE versions, EAGLE3 uses auxiliary hidden states from **multiple intermediate
+layers** of the base model as additional input to the draft decoder, not just the final layer output.
+This richer signal enables the draft module to more accurately predict the base model's next-layer
+representations, resulting in higher token acceptance rates and greater inference speedup.
diff --git a/docs/source/guides/_eagle_best_practices.rst b/docs/source/guides/_eagle_best_practices.rst
@@ -0,0 +1,107 @@
+.. _eagle-best-practices:
+
+Best Practices
+====================
+
+This page collects practical recommendations for achieving the best results when training EAGLE
+speculative decoding models.
+
+
+.. _eagle-best-practices-data-synthesis:
+
+Data Synthesis
+--------------
+
+Training on conversations **generated by the base model** rather than human-authored datasets
+significantly improves token acceptance rates. The draft module learns to predict the target
+model's actual output distribution, not just surface-level text patterns.
+
+To prepare synthetic training data, launch an inference server with the base model:
+
+.. code-block:: bash
+
+    pip install vllm
+    vllm serve meta-llama/Llama-3.2-1B-Instruct \
+        --api-key token-abc123 \
+        --port 8000 \
+        --tensor-parallel-size 1
+
+.. note::
+
+    For quantized models, add ``--quantization=modelopt`` to the ``vllm serve`` command.
+
+Then generate conversations using prompts from your training dataset:
+
+.. code-block:: bash
+
+    python scripts/server_generate.py \
+        --data_path input_conversations/daring-anteater.jsonl \
+        --output_path synthetic/train.jsonl
+
+Use ``--system_prompt <text>`` to inject a system prompt into every conversation.
+
+For large-scale generation across multiple GPUs, see the
+`SLURM data preparation guide <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/speculative_decoding/SLURM_prepare_data.md>`_.
+
+
+.. _eagle-best-practices-configure:
+
+Configuring the Draft Model
+----------------------------
+
+ModelOpt ships with sensible default architectures for EAGLE‑1 and EAGLE‑3. See
+:ref:`eagle-config-reference` for the full list of configurable fields.
+
+When launching training via ``launch_train.sh``, pass a JSON override file with
+``--eagle_config <file>``. Only the fields you want to change need to be specified; omitted
+fields fall back to the built-in defaults. For example, to use a 2-layer draft decoder with a
+larger MLP:
+
+.. code-block:: json
+
+    {
+        "num_hidden_layers": 2,
+        "intermediate_size": 8192
+    }
+
+See :ref:`eagle-config-reference` for more details.
+
+
+.. _eagle-best-practices-vocab:
+
+Draft Vocabulary Compression
+-----------------------------
+
+By default the draft model shares the full vocabulary of the base model. For large vocabularies
+(e.g., 128 256 tokens in Llama‑3) you can compress the draft vocabulary to a smaller working set,
+reducing embedding table size and speeding up both training and inference.
+
+**Step 1 — Calibrate a vocabulary mapping**
+
+Find the most frequently used tokens in your training set and save a ``d2t.pt`` mapping file:
+
+.. code-block:: bash
+
+    python scripts/calibrate_draft_vocab.py \
+        --model meta-llama/Llama-3.2-1B-Instruct \
+        --data input_conversations/daring-anteater.jsonl \
+        --draft_vocab_size 32000 \
+        --save_dir draft_vocab_cache
+
+The ``d2t.pt`` file maps each compressed draft token index to its offset in the target vocabulary.
+During inference the target token is recovered as:
+
+.. code-block:: text
+
+    target_token = draft_token_index + d2t[draft_token_index]
+
+**Step 2 — Enable compressed vocabulary in training**
+
+Add the following to your ``eagle_config.json``:
+
+.. code-block:: json
+
+    {"draft_vocab_size": 32000}
+
+Then pass ``--draft_vocab_cache <path_to_d2t.pt>`` when running ``./launch_train.sh``. The draft
+model will use the compressed vocabulary table during both training and export.
diff --git a/docs/source/guides/_eagle_config_reference.rst b/docs/source/guides/_eagle_config_reference.rst
@@ -0,0 +1,102 @@
+.. _eagle-config-reference:
+
+Configuration Reference
+===============================
+
+EAGLE3 is configured through a dict passed to :meth:`mtsp.convert()
+<modelopt.torch.speculative.speculative_decoding.convert>`. The top-level keys correspond to
+fields of :class:`EagleConfig <modelopt.torch.speculative.config.EagleConfig>`, with
+``eagle_architecture_config`` containing a nested dict of draft module architecture settings.
+
+.. code-block:: python
+
+    config = {
+        # --- EagleConfig top-level fields ---
+        "eagle_decoder_type": "llama",
+        "eagle_freeze_base_model": True,
+        "eagle_self_logit_distillation": True,
+        "eagle_offline": False,
+        "eagle_loss_decay_factor": 0.9,
+
+        # --- Draft module architecture ---
+        "eagle_architecture_config": {
+            "num_hidden_layers": 1,
+            "intermediate_size": 8192,
+            ...
+        },
+    }
+    mtsp.convert(model, [("eagle", config)])
+
+
+EagleConfig fields
+------------------
+
+``eagle_decoder_type`` (*str*, default: ``"llama"``)
+    Draft decoder architecture. Use ``"llama"`` for most models; ``"kimik2"`` for Kimi-K2 models.
+
+``eagle_freeze_base_model`` (*bool*, default: ``True``)
+    Keep the base model weights frozen during training. Disabling this allows joint fine-tuning
+    but significantly increases memory usage.
+
+``eagle_self_logit_distillation`` (*bool*, default: ``True``)
+    Apply logit-level distillation loss in addition to hidden-state regression. Improves token
+    acceptance rates without extra inference cost.
+
+``eagle_offline`` (*bool*, default: ``False``)
+    Use pre-computed hidden states from disk instead of running the base model forward pass at
+    each training step. Required for large models (70B+) that cannot be co-located with the
+    draft module in GPU memory. See :ref:`Offline Training <speculative_decoding_workflow:Offline Training>`.
+
+``eagle_loss_decay_factor`` (*float*, default: ``0.9``)
+    Exponential decay applied to losses at successive draft steps, weighting earlier steps more
+    heavily during training.
+
+``eagle_architecture_config`` (*dict*, default: ``{}``)
+    Overrides for the draft module architecture. See `eagle_architecture_config fields`_ below.
+    ``hidden_size``, ``vocab_size``, and ``max_position_embeddings`` are inferred from the base
+    model and should not be set here.
+
+
+eagle_architecture_config fields
+---------------------------------
+
+These keys override the default draft module architecture. Only set the fields you need to
+change; unspecified fields fall back to the defaults listed below (for ``eagle_decoder_type="llama"``).
+
+``num_hidden_layers`` (*int*, default: ``1``)
+    Number of transformer layers in the draft decoder. Increasing this improves acceptance rates
+    at the cost of higher draft latency.
+
+``intermediate_size`` (*int*, default: inferred from base model)
+    Feed-forward intermediate dimension of the draft decoder MLP.
+
+``num_attention_heads`` (*int*, default: ``32``)
+    Number of attention heads in the draft decoder.
+
+``num_key_value_heads`` (*int*, default: ``8``)
+    Number of key/value heads (grouped-query attention). Set equal to ``num_attention_heads``
+    to disable GQA.
+
+``hidden_act`` (*str*, default: ``"silu"``)
+    Activation function used in the MLP layers.
+
+``use_aux_hidden_state`` (*bool*, default: ``False``)
+    Feed auxiliary hidden states from intermediate base model layers into the draft decoder—the
+    key EAGLE3 feature. Set to ``True`` for EAGLE3; ``False`` gives EAGLE1 behaviour.
+
+``eagle_aux_hidden_state_layer_ids`` (*list[int]*, default: ``[]``)
+    Indices of base model layers whose hidden states are used as auxiliary inputs. Populated
+    automatically when ``use_aux_hidden_state=True``; override only for custom layer selection.
+
+``use_last_layernorm`` (*bool*, default: ``False``)
+    Apply a layer-norm after the last draft decoder layer. Required when
+    ``use_aux_hidden_state=True`` (i.e., EAGLE3 mode).
+
+``parallel_draft_step`` (*int*, default: ``1``)
+    Number of tokens drafted in parallel per step. Values greater than 1 enable parallel
+    speculative decoding and can further reduce latency on suitable hardware.
+
+.. note::
+
+    The complete set of architecture fields and their defaults can be found in
+    :mod:`modelopt.torch.speculative.eagle.default_config`.