Skip to content

[Bug]: RuntimeError: Int8 not supported for this architecture #2048

@fenghuohuo2001

Description

@fenghuohuo2001

⚙️ Your current environment

When I used the GPQT method of the llmcompressor library to perform int8 quantization on Qwen3-VL-4B with an RTX 5090 graphics card, and ran inference using vllm version 0.11.0, the following error occurred: RuntimeError: Int8 not supported for this architecture.
However, it works normally on an RTX 4090 graphics card.

🐛 Describe the bug

python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7
root@ms-22309-server-5090test-1-1114131839-67558b84b-mmsnq:/workspace# python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7
INFO 11-17 19:59:36 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=203326) INFO 11-17 19:59:38 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=203326) INFO 11-17 19:59:38 [utils.py:233] non-default args: {'port': 9001, 'model': '/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', 'max_model_len': 8000, 'served_model_name': ['base_model'], 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 30}, 'max_num_batched_tokens': 15360, 'enable_chunked_prefill': True, 'compilation_config': {"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":null,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}}
(APIServer pid=203326) INFO 11-17 19:59:38 [model.py:547] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(APIServer pid=203326) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=203326) INFO 11-17 19:59:38 [model.py:1510] Using max model len 8000
(APIServer pid=203326) INFO 11-17 19:59:39 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=15360.
INFO 11-17 19:59:42 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', speculative_config=None, tokenizer='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=base_model, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:45 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:46 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2602] Starting to load model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8...
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [compressed_tensors_w8a8_int8.py:52] Using CutlassScaledMMLinearKernel for CompressedTensorsW8A8Int8
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.91it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  2.15it/s]
(EngineCore_DP0 pid=203469) 
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [default_loader.py:267] Loading weights took 1.48 seconds
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:2653] Model loading took 9.5183 GiB and 1.635616 seconds
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:559] Dynamo bytecode transform time: 5.03 s
(EngineCore_DP0 pid=203469) INFO 11-17 20:00:01 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=203469) INFO 11-17 20:00:18 [backends.py:218] Compiling a graph for dynamic shape takes 19.18 s
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     self.model_runner.profile_run()
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     outputs = self.model(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]               ^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     def forward(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     raise e
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "<eval_with_key>.58", line 317, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7);  l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return compiled_fn(full_args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]                             ^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     outs = compiled_fn(args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self.current_callable(inputs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     out = model(new_inputs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] RuntimeError: Int8 not supported for this architecture
(EngineCore_DP0 pid=203469) Process EngineCore_DP0:
(EngineCore_DP0 pid=203469) Traceback (most recent call last):
(EngineCore_DP0 pid=203469)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=203469)     self.run()
(EngineCore_DP0 pid=203469)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=203469)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=203469)     raise e
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=203469)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=203469)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=203469)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=203469)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=203469)     self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=203469)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=203469)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=203469)     return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=203469)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=203469)     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469)     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=203469)     self.model_runner.profile_run()
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run
(EngineCore_DP0 pid=203469)     = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_DP0 pid=203469)       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469)     return func(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run
(EngineCore_DP0 pid=203469)     outputs = self.model(
(EngineCore_DP0 pid=203469)               ^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward
(EngineCore_DP0 pid=203469)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=203469)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__
(EngineCore_DP0 pid=203469)     output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=203469)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
(EngineCore_DP0 pid=203469)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward
(EngineCore_DP0 pid=203469)     def forward(
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(EngineCore_DP0 pid=203469)     return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(EngineCore_DP0 pid=203469)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(EngineCore_DP0 pid=203469)     raise e
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(EngineCore_DP0 pid=203469)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "<eval_with_key>.58", line 317, in forward
(EngineCore_DP0 pid=203469)     submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7);  l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
(EngineCore_DP0 pid=203469)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__
(EngineCore_DP0 pid=203469)     return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
(EngineCore_DP0 pid=203469)     return compiled_fn(full_args)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper
(EngineCore_DP0 pid=203469)     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=203469)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=203469)     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=203469)                             ^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn
(EngineCore_DP0 pid=203469)     outs = compiled_fn(args)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper
(EngineCore_DP0 pid=203469)     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__
(EngineCore_DP0 pid=203469)     return self.current_callable(inputs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run
(EngineCore_DP0 pid=203469)     out = model(new_inputs)
(EngineCore_DP0 pid=203469)           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469)   File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call
(EngineCore_DP0 pid=203469)     torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1)
(EngineCore_DP0 pid=203469)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__
(EngineCore_DP0 pid=203469)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=203469)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) RuntimeError: Int8 not supported for this architecture
[rank0]:[W1117 20:00:20.593756440 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=203326) Traceback (most recent call last):
(APIServer pid=203326)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=203326)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1953, in <module>
(APIServer pid=203326)     uvloop.run(run_server(args))
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=203326)     return __asyncio.run(
(APIServer pid=203326)            ^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=203326)     return runner.run(main)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=203326)     return self._loop.run_until_complete(task)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=203326)     return await main
(APIServer pid=203326)            ^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=203326)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=203326)     async with build_async_engine_client(
(APIServer pid=203326)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=203326)     return await anext(self.gen)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=203326)     async with build_async_engine_client_from_engine_args(
(APIServer pid=203326)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=203326)     return await anext(self.gen)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=203326)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=203326)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=203326)     return fn(*args, **kwargs)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=203326)     return cls(
(APIServer pid=203326)            ^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=203326)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=203326)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=203326)     return AsyncMPClient(*client_args)
(APIServer pid=203326)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=203326)     super().__init__(
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=203326)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=203326)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=203326)     next(self.gen)
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=203326)     wait_for_engine_startup(
(APIServer pid=203326)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=203326)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=203326) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

🛠️ Steps to reproduce

I have carefully read the content of vllm-project/vllm#27337, but it does not mention a solution. Could you please provide a way to resolve this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions