Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 31 additions & 99 deletions docs/source/guides/5_speculative_decoding.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,127 +2,59 @@
Speculative Decoding
====================

Introduction
============

ModelOpt's Speculative Decoding module (:mod:`modelopt.torch.speculative <modelopt.torch.speculative>`)
enables your model to generate multiple tokens in each generate step. This can be useful for reducing the
latency of your model and speeds up inference.

Below are the speculative decoding algorithms supported by ModelOpt:
- Medusa
- EAGLE


Follow the steps described below to obtain a model with Medusa or EAGLE speculative decoding using ModelOpt's
Speculative Decoding module :mod:`modelopt.torch.speculative`:

#. **Convert your model via** :meth:`mtsp.convert <modelopt.torch.speculative.speculative_decoding.convert>`:
Add Medusa heads or EAGLE module to your model.
#. **Fine-tune Medusa heads or EAGLE module**: Fine-tune the Medusa heads or EAGLE module.
The base model is recommended to be frozen.
#. **Checkpoint and re-load**: Save the model via :meth:`mto.save <modelopt.torch.opt.conversion.save>` and
restore via :meth:`mto.restore <modelopt.torch.opt.conversion.restore>`

.. _speculative_conversion:

Convert
=======

You can convert your model to a speculative decoding model using :meth:`mtsp.convert()
<modelopt.torch.speculative.speculative_decoding.convert>`.

Example usage:

.. code-block:: python
enables your model to generate multiple tokens in each generate step, reducing inference latency.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import modelopt.torch.speculative as mtsp
ModelOpt implements the **EAGLE3** algorithm, which attaches a lightweight autoregressive draft
module to a frozen base model. The draft module operates at the feature level—predicting future
hidden states rather than tokens directly—to achieve high acceptance rates at low compute cost.

# User-defined model
model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
tokenizer.pad_token_id = tokenizer.eos_token_id

if mode == "medusa":
# Configure and convert to medusa
config = {
"medusa_num_heads": 2,
"medusa_num_layers": 1,
}
elif mode == "eagle":
config = {
"eagle_num_layers": 1
}
mtsp.convert(model, [(mode, config)])

.. toctree::
:maxdepth: 1
:caption: Module Guide

Fine-tune speculative decoding model and store/restore the model
----------------------------------------------------------------
./_speculative_module_guide.rst

After converting to a speculative decoding model, you need to fine-tune the decoding module:
.. toctree::
:maxdepth: 1
:caption: EAGLE

.. code-block:: python
./_eagle_workflow.rst
./_eagle_config_reference.rst
./_eagle_best_practices.rst

import os
from transformers import Trainer
import modelopt.torch.opt as mto

mto.enable_huggingface_checkpointing()

trainer = Trainer(model=model, processing_class=tokenizer, args=training_args, **data_module)
trainer._move_model_to_device(model, trainer.args.device)

trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_state()
trainer.save_model("<path to the output directory>")


To restore the saved speculative decoding model:

.. code-block:: python

model = AutoModelForCausalLM.from_pretrained("<path to the output directory>")

.. _speculative-concepts:

Speculative Decoding Concepts
=============================

Below, we will provide an overview of ModelOpt's speculative decoding feature as well as its basic
concepts and terminology.
==============================

Speculative decoding
--------------------

The standard way of generating text from a language model is with autoregressive decoding: one token
is generated each step and appended to the input context for the next token generation. This means
to generate *K* tokens it will take *K* serial runs of the model. Inference from large autoregressive
models like Transformers can be slow and expensive. Therefore, various *speculative decoding* algorithms
have been proposed to accelerate text generation, especially in latency critical applications.

Typically, a short draft of length *K* is generated using a faster, auto-regressive model, called draft
model. This can be attained with either a parallel model or by calling the draft model *K* times.
Then, a larger and more powerful model, called target model, is used to score the draft. Last, a sampling
scheme is used to decide which draft to accept by the target model, recovering the distribution of the
target model in the process.
Typically, a short draft of length *K* is generated using a faster model, called the *draft model*.
Then, a larger and more powerful model, called the *target model*, verifies the draft in a single
forward pass. A sampling scheme decides which draft tokens to accept, recovering the output
distribution of the target model in the process.

Medusa algorithm
EAGLE3 algorithm
----------------

There are many ways to achieve speculative decoding. A popular approach is Medusa where instead of
using an additional draft model, it introduces a few additional decoding heads to predict multiple
future tokens simultaneously. During generation, these heads each produce multiple likely words for
the corresponding position. These options are then combined and processed using a tree-based attention
mechanism. Finally, a typical acceptance scheme is employed to pick the longest plausible prefix from
the candidates for further decoding. Since the draft model is the target model itself, this guarantees
the output distribution is the same as that of the target model.

EAGLE algorithm
---------------

Unlike Medusa that predicts future tokens based on the base model hidden states, EAGLE predicts
future hidden states through a lightweight autoregressive decoder, which is then used to
predict the future tokens. Since autoregression at the feature (hidden states) level is simpler
than at the token level, EAGLE can predict future tokens more accurately than Medusa. Therefore, it
achieves higher speedup.
EAGLE3 attaches a lightweight autoregressive decoder (the draft module) to a frozen base model.
Unlike token-level autoregression, the draft module operates at the *feature level*: it predicts
future hidden states, which are then projected to token logits. Autoregression over hidden states
is an easier task than over tokens, so the draft module achieves high prediction accuracy with low
compute cost.

Compared to earlier EAGLE versions, EAGLE3 uses auxiliary hidden states from **multiple intermediate
layers** of the base model as additional input to the draft decoder, not just the final layer output.
This richer signal enables the draft module to more accurately predict the base model's next-layer
representations, resulting in higher token acceptance rates and greater inference speedup.
107 changes: 107 additions & 0 deletions docs/source/guides/_eagle_best_practices.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
.. _eagle-best-practices:

Best Practices
====================

This page collects practical recommendations for achieving the best results when training EAGLE
speculative decoding models.


.. _eagle-best-practices-data-synthesis:

Data Synthesis
--------------

Training on conversations **generated by the base model** rather than human-authored datasets
significantly improves token acceptance rates. The draft module learns to predict the target
model's actual output distribution, not just surface-level text patterns.

To prepare synthetic training data, launch an inference server with the base model:

.. code-block:: bash

pip install vllm
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--api-key token-abc123 \
--port 8000 \
--tensor-parallel-size 1

.. note::

For quantized models, add ``--quantization=modelopt`` to the ``vllm serve`` command.

Then generate conversations using prompts from your training dataset:

.. code-block:: bash

python scripts/server_generate.py \
--data_path input_conversations/daring-anteater.jsonl \
--output_path synthetic/train.jsonl

Use ``--system_prompt <text>`` to inject a system prompt into every conversation.

For large-scale generation across multiple GPUs, see the
`SLURM data preparation guide <https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/speculative_decoding/SLURM_prepare_data.md>`_.


.. _eagle-best-practices-configure:

Configuring the Draft Model
----------------------------

ModelOpt ships with sensible default architectures for EAGLE‑1 and EAGLE‑3. See
:ref:`eagle-config-reference` for the full list of configurable fields.

When launching training via ``launch_train.sh``, pass a JSON override file with
``--eagle_config <file>``. Only the fields you want to change need to be specified; omitted
fields fall back to the built-in defaults. For example, to use a 2-layer draft decoder with a
larger MLP:

.. code-block:: json

{
"num_hidden_layers": 2,
"intermediate_size": 8192
}

See :ref:`eagle-config-reference` for more details.


.. _eagle-best-practices-vocab:

Draft Vocabulary Compression
-----------------------------

By default the draft model shares the full vocabulary of the base model. For large vocabularies
(e.g., 128 256 tokens in Llama‑3) you can compress the draft vocabulary to a smaller working set,
reducing embedding table size and speeding up both training and inference.

**Step 1 — Calibrate a vocabulary mapping**

Find the most frequently used tokens in your training set and save a ``d2t.pt`` mapping file:

.. code-block:: bash

python scripts/calibrate_draft_vocab.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--data input_conversations/daring-anteater.jsonl \
--draft_vocab_size 32000 \
--save_dir draft_vocab_cache

The ``d2t.pt`` file maps each compressed draft token index to its offset in the target vocabulary.
During inference the target token is recovered as:

.. code-block:: text

target_token = draft_token_index + d2t[draft_token_index]

**Step 2 — Enable compressed vocabulary in training**

Add the following to your ``eagle_config.json``:

.. code-block:: json

{"draft_vocab_size": 32000}

Then pass ``--draft_vocab_cache <path_to_d2t.pt>`` when running ``./launch_train.sh``. The draft
model will use the compressed vocabulary table during both training and export.
102 changes: 102 additions & 0 deletions docs/source/guides/_eagle_config_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
.. _eagle-config-reference:

Configuration Reference
===============================

EAGLE3 is configured through a dict passed to :meth:`mtsp.convert()
<modelopt.torch.speculative.speculative_decoding.convert>`. The top-level keys correspond to
fields of :class:`EagleConfig <modelopt.torch.speculative.config.EagleConfig>`, with
``eagle_architecture_config`` containing a nested dict of draft module architecture settings.

.. code-block:: python

config = {
# --- EagleConfig top-level fields ---
"eagle_decoder_type": "llama",
"eagle_freeze_base_model": True,
"eagle_self_logit_distillation": True,
"eagle_offline": False,
"eagle_loss_decay_factor": 0.9,

# --- Draft module architecture ---
"eagle_architecture_config": {
"num_hidden_layers": 1,
"intermediate_size": 8192,
...
},
}
mtsp.convert(model, [("eagle", config)])


EagleConfig fields
------------------

``eagle_decoder_type`` (*str*, default: ``"llama"``)
Draft decoder architecture. Use ``"llama"`` for most models; ``"kimik2"`` for Kimi-K2 models.

``eagle_freeze_base_model`` (*bool*, default: ``True``)
Keep the base model weights frozen during training. Disabling this allows joint fine-tuning
but significantly increases memory usage.

``eagle_self_logit_distillation`` (*bool*, default: ``True``)
Apply logit-level distillation loss in addition to hidden-state regression. Improves token
acceptance rates without extra inference cost.

``eagle_offline`` (*bool*, default: ``False``)
Use pre-computed hidden states from disk instead of running the base model forward pass at
each training step. Required for large models (70B+) that cannot be co-located with the
draft module in GPU memory. See :ref:`Offline Training <speculative_decoding_workflow:Offline Training>`.

``eagle_loss_decay_factor`` (*float*, default: ``0.9``)
Exponential decay applied to losses at successive draft steps, weighting earlier steps more
heavily during training.

``eagle_architecture_config`` (*dict*, default: ``{}``)
Overrides for the draft module architecture. See `eagle_architecture_config fields`_ below.
``hidden_size``, ``vocab_size``, and ``max_position_embeddings`` are inferred from the base
model and should not be set here.


eagle_architecture_config fields
---------------------------------

These keys override the default draft module architecture. Only set the fields you need to
change; unspecified fields fall back to the defaults listed below (for ``eagle_decoder_type="llama"``).

``num_hidden_layers`` (*int*, default: ``1``)
Number of transformer layers in the draft decoder. Increasing this improves acceptance rates
at the cost of higher draft latency.

``intermediate_size`` (*int*, default: inferred from base model)
Feed-forward intermediate dimension of the draft decoder MLP.

``num_attention_heads`` (*int*, default: ``32``)
Number of attention heads in the draft decoder.

``num_key_value_heads`` (*int*, default: ``8``)
Number of key/value heads (grouped-query attention). Set equal to ``num_attention_heads``
to disable GQA.

``hidden_act`` (*str*, default: ``"silu"``)
Activation function used in the MLP layers.

``use_aux_hidden_state`` (*bool*, default: ``False``)
Feed auxiliary hidden states from intermediate base model layers into the draft decoder—the
key EAGLE3 feature. Set to ``True`` for EAGLE3; ``False`` gives EAGLE1 behaviour.

``eagle_aux_hidden_state_layer_ids`` (*list[int]*, default: ``[]``)
Indices of base model layers whose hidden states are used as auxiliary inputs. Populated
automatically when ``use_aux_hidden_state=True``; override only for custom layer selection.

``use_last_layernorm`` (*bool*, default: ``False``)
Apply a layer-norm after the last draft decoder layer. Required when
``use_aux_hidden_state=True`` (i.e., EAGLE3 mode).

``parallel_draft_step`` (*int*, default: ``1``)
Number of tokens drafted in parallel per step. Values greater than 1 enable parallel
speculative decoding and can further reduce latency on suitable hardware.

.. note::

The complete set of architecture fields and their defaults can be found in
:mod:`modelopt.torch.speculative.eagle.default_config`.
Loading