Training Draft for Large models (e.g 70B)

Hi,
I'm training an eagle3 draft model for speculative decoding for a large model such as Llama-70B:
`
bash train_eagle3_and_export.sh \
  --base_model /opt/ml/model/base_models/Llama3.3-70B-instruct \
  --data /opt/ml/model/data/train.jsonl \
  --num_gpu 8
`
 From what I understand, this requires using the offline training mode. However, when I tried training without saving hidden states, I ran into OOM issues. It looks like ~97% of the Llama-70B parameters remain trainable even after [converting to an Eagle draft model](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/1aaa77d1af66ebedf072fd4327c461a192d9dec9/examples/speculative_decoding/main.py#L212).

How is this expected? does the final conversion into an Eagle model happens after training, using the export_hf_checkpoint.py script. Does this mean the model remains mostly unfrozen during training?

Additionally, I attempted dumping hidden states to disk using the run_hf_compute_hiddens_dp script you provided, but the process is extremely slow — around 4 days for only 120K examples on 8×A100 80GB.

Is there a more efficient workflow or recommended approach for training a draft model for Llama-70B?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training Draft for Large models (e.g 70B) #593

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training Draft for Large models (e.g 70B) #593

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions