-
Notifications
You must be signed in to change notification settings - Fork 332
ONNX models generated by llm_export.py are missing some input and output nodes #1147
Description
Describe the bug
I am using Model-Optimizer/examples/torch_onnx/llm_export.py script to convert a .safetensors LLM model to the ONNX format and quantize it. The model is supposed to be later converted then into the TRT format for being used by TensorRT. The so produced ONNX model has "input_ids", "logits", present_key_values*" but is missing "position_ids", "attention_mask" and "past_kv*" nodes.
Steps/Code to reproduce bug
Install packages
python -m pip install nvidia-modelopt[all]
python -m pip install onnx==1.18.0
python -m pip install onnxruntime[gpu]==1.23.0
and all others on demand once requested during running llm_export.py. Set up paths:
export LD_LIBRARY_PATH=<path/to/cuda/libs>:<path/to/cudnn/lib>
export PATH=<path/to/cuda/bin>:$PATH
Clone the ModelOptimizer repo in order to use the example scripts
git clone https://github.com/NVIDIA/Model-Optimizer.git
Navigate to torch_onnx example
cd Model-Optimizer/examples/torch_onnx
and launch conversion of HF model to ONNX INT4 quantization:
python llm_export.py --hf_model_path=meta-llama/Llama-3.1-8B-Instruct --dtype=int4_awq --calib_size=512 --output_dir=models/Llama-3.1-8B-Instruct-ONNX-INT4
Result: the produced ONNX model is missing "position_ids", "attention_mask" and "past_kv*" nodes.
Expected behavior
A typical LLM model must have "input_ids", "attention_mask", "logits", past and present kv-cache nodes. In fact, some of them are missing.
System information
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? Ubuntu 20.04
- CPU architecture (x86_64, aarch64): x86_64
- GPU memory size: enough
- Library versions (if applicable):
- Python: 3.12
- ModelOpt version or commit hash: >=0.39
- CUDA: 12.3
- PyTorch: 2.7.1+cu118
- Transformers: 4.57.3
- onnxruntime-gpu: 1.23.0