Skip to content

Speculative Decoding #1066

@AakashKumarNain

Description

@AakashKumarNain

How would you like to use ModelOpt

I looked at the documentation for speculative decoding, and there are a few things that I couldn't find in the documentation.

  1. The support matrix does not list any model from the Kimi family, yet I see a drafter for K2 trained by Nvidia. Are Kimi models like K2.5 supported or not? If they are supported, can we please update the documentation and usage accordingly?

  2. The advanced usage section showcases that we can use a model served using vllm server for generating the dataset. But for hidden state extraction, only TRT-LLM is supported, right?

  3. The training section for the draft model focuses on HuggingFace, and I don't think that works for Kimi. In most case we would either generate the dataset or extract the hidden state offline for it and then train unless we have more resources. Can you please clarify the usage with an example?

Who can help?

  • ?

System information

  • Container used (if applicable): nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc8
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? Ubuntu 24.04
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): H200
  • GPU memory size: 140G
  • Number of GPUs: 8
  • Library versions (if applicable):
    • Python: 3.12
    • ModelOpt version or commit hash: 0.37
    • CUDA: 13.1
    • PyTorch: 2.9.1
    • Transformers: 4.57.3
    • TensorRT-LLM: 1.3.0rc8
    • TensorRT: 10.14.1

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions