-
Notifications
You must be signed in to change notification settings - Fork 325
Speculative Decoding #1066
Description
How would you like to use ModelOpt
I looked at the documentation for speculative decoding, and there are a few things that I couldn't find in the documentation.
-
The support matrix does not list any model from the Kimi family, yet I see a drafter for K2 trained by Nvidia. Are Kimi models like K2.5 supported or not? If they are supported, can we please update the documentation and usage accordingly?
-
The advanced usage section showcases that we can use a model served using vllm server for generating the dataset. But for hidden state extraction, only TRT-LLM is supported, right?
-
The training section for the draft model focuses on HuggingFace, and I don't think that works for Kimi. In most case we would either generate the dataset or extract the hidden state offline for it and then train unless we have more resources. Can you please clarify the usage with an example?
Who can help?
- ?
System information
- Container used (if applicable): nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc8
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? Ubuntu 24.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): H200
- GPU memory size: 140G
- Number of GPUs: 8
- Library versions (if applicable):
- Python: 3.12
- ModelOpt version or commit hash: 0.37
- CUDA: 13.1
- PyTorch: 2.9.1
- Transformers: 4.57.3
- TensorRT-LLM: 1.3.0rc8
- TensorRT: 10.14.1