Skip to content
Open
96 changes: 31 additions & 65 deletions examples/speculative_decoding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,16 @@ This one-line command runs a minimal example workflow of training and exporting
For small base models that fit in GPU memory, we can collocate them with draft models and train with the following command:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data input_conversations/train.jsonl \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.data_path=input_conversations/train.jsonl \
training.output_dir=ckpts/llama-3.2-1b-online
```

FSDP2 is used by default. To enable context parallelism for long-context training, specify `--cp_size n`.
All default training settings live in `eagle3.yaml`; override any field via OmegaConf dotlist arguments on the command line.

To enable context parallelism for long-context training, add `training.cp_size=<N>` to the overrides.
The saved modelopt checkpoint is similar in architecture to HF models. It can be further optimized through **ModelOpt**, e.g., PTQ and QAT.

## Training Draft Model with Offline Base Model
Expand Down Expand Up @@ -113,15 +115,14 @@ python collect_hidden_states/compute_hidden_states_hf.py \

### Train Draft Model with Dumped Hidden States

Once we finish dumping hidden states, launch offline training with an extra `--offline-data` argument:
Once we finish dumping hidden states, launch offline training pointing to the hidden states directory:

```bash
./launch_train.sh --model $BASE_MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs $NUM_EPOCH \
--eagle_config eagle_config.json \
--offline-data $HIDDEN_STATES_DIR
./launch_train.sh \
--config ../../modelopt_recipes/general/speculative_decoding/eagle3.yaml \
model.model_name_or_path=meta-llama/Llama-3.2-1B \
data.offline_data_path=$HIDDEN_STATES_DIR \
training.output_dir=ckpts/llama-3.2-1b-offline
```

## Model Validation
Expand Down Expand Up @@ -244,13 +245,13 @@ For large scale data generation, please see [SLURM prepare data](SLURM_prepare_d

### Configuring Draft Model

For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings by providing an additional JSON dict. E.g. To use 2-layer eagle with 8192 intermediate size for MLP, set `eagle_config.json` to:
For EAGLE‑1 and EAGLE‑3 we provide a [default model architecture config](https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/speculative/config.py#L37) in ModelOpt. You can override default settings via `eagle.eagle_architecture_config` in the YAML. E.g. to use a 2-layer EAGLE head with 8192 intermediate size:

```json
{
"num_hidden_layers": 2,
"intermediate_size":8192
}
```yaml
eagle:
eagle_architecture_config:
num_hidden_layers: 2
intermediate_size: 8192
```

### Draft Vocabulary Compression
Expand All @@ -263,61 +264,26 @@ python scripts/calibrate_draft_vocab.py --model meta-llama/Llama-3.2-1B-Instruct

This will produce a `d2t.pt` file in `save_dir`, which is the mapping from draft token to target token. During inference, draft tokens can be mapped back to target tokens by `target_token = draft_token + d2t[draft_token]`.

Then, simply set `{"draft_vocab_size":32000}` in `eagle_config.json` and include `--draft_vocab_cache <path_to_d2t.pt>` when running `./launch_train.sh`. The draft model will use this provided vocab table during training and export.
Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use the full nested YAML path for draft_vocab_size.

The runtime schema nests this under eagle, so eagle_architecture_config.draft_vocab_size reads like a top-level key. Use eagle.eagle_architecture_config.draft_vocab_size to match the actual config structure.

Suggested fix
-Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
+Then, set `eagle.eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Then, set `eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
Then, set `eagle.eagle_architecture_config.draft_vocab_size: 32000` and `data.draft_vocab_cache: <path_to_d2t.pt>` in your YAML. The draft model will use this provided vocab table during training and export.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/README.md` at line 261, The README uses the
wrong top-level key for the draft vocab size; update the YAML example to the
correct nested key eagle.eagle_architecture_config.draft_vocab_size (keep
data.draft_vocab_cache as shown) and search/update any other documentation or
examples that reference eagle_architecture_config.draft_vocab_size so they use
eagle.eagle_architecture_config.draft_vocab_size to match the runtime schema.


### Interact with `modelopt.torch.speculative`

`main.py` provides an example for converting a HF base model for speculative decoding and training it. It consists of a few simple steps:
First, load the base model and tokenizer from Hugging Face:

```python
model = transformers.AutoModelForCausalLM.from_pretrained(
"<path to your pretrained model>"
)
```

Then, load default eagle config and make necessary overwrites:
`main.py` provides a complete example for converting a HF base model for speculative decoding and training it. The core steps are loading the base model, converting it with an eagle config dict, and training with HF Trainer:

```python
# Load default config
config = {
"eagle1": EAGLE1_DEFAULT_CFG,
"eagle3": EAGLE3_DEFAULT_CFG,
}[training_args.mode]["config"]

# overwrite config with custom config
config["eagle_architecture_config"].update({"<overwrite_keys>": "<overwrite_values>"})

# Mandatory: hidden size, vocab size and max position embeddings must match base model
config["eagle_architecture_config"].update(
{
"hidden_size": model.config.hidden_size,
"vocab_size": model.config.vocab_size,
"max_position_embeddings": model.config.max_position_embeddings,
}
)
```
import modelopt.torch.speculative as mtsp

Then, we convert model to a speculative decoding model:
# Convert base model in-place to an EAGLE speculative decoding model
eagle_cfg = {"eagle_decoder_type": "llama", ...} # fields from EagleConfig
mtsp.convert(model, [("eagle", eagle_cfg)])

```python
mtsp.convert(model, [("eagle", config)])
# Train with HF Trainer as usual
trainer = transformers.Trainer(model=model, ...)
trainer.train()
trainer.save_model("<output_dir>")
```

This will modify the model in-place with eagle training forward, making it compatible with HF trainer:

```python
# Create a trainer
trainer = transformers.Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
trainer._move_model_to_device(model, trainer.args.device)

# Enable HF checkpointing so that the saved model will contain the speculative decoding module
mto.enable_huggingface_checkpointing()

trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_state()
trainer.save_model("<path to the output directory>")
```
See `main.py` for the full example including tokenizer setup, dataset loading, and checkpoint handling.

## Support Matrix

Expand Down
2 changes: 0 additions & 2 deletions examples/speculative_decoding/eagle_config.json

This file was deleted.

1 change: 0 additions & 1 deletion examples/speculative_decoding/fsdp_config.json

This file was deleted.

Loading
Loading