-
Notifications
You must be signed in to change notification settings - Fork 292
Open
Labels
bugSomething isn't workingSomething isn't working
Description
⚙️ Your current environment
When I used the GPQT method of the llmcompressor library to perform int8 quantization on Qwen3-VL-4B with an RTX 5090 graphics card, and ran inference using vllm version 0.11.0, the following error occurred: RuntimeError: Int8 not supported for this architecture.
However, it works normally on an RTX 4090 graphics card.
🐛 Describe the bug
python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7root@ms-22309-server-5090test-1-1114131839-67558b84b-mmsnq:/workspace# python3 -m vllm.entrypoints.openai.api_server --model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8 --served-model-name base_model --port 9001 --tensor-parallel-size 1 --dtype auto --enable-prefix-caching --enable-chunked-prefill --max-model-len 8000 --max-num-batched-tokens 15360 --limit-mm-per-prompt '{"image":30}' --compilation-config '{"level": 3,"compilation_mode":"default","compiler":"vllm-compiler","configs":{"model":{"vision_language":{"enable":true,"vision_encoder_compilation_mode":"disable","llm_compilation_mode":"enable"}}}}' --gpu-memory-utilization 0.7
INFO 11-17 19:59:36 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=203326) INFO 11-17 19:59:38 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=203326) INFO 11-17 19:59:38 [utils.py:233] non-default args: {'port': 9001, 'model': '/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', 'max_model_len': 8000, 'served_model_name': ['base_model'], 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True, 'limit_mm_per_prompt': {'image': 30}, 'max_num_batched_tokens': 15360, 'enable_chunked_prefill': True, 'compilation_config': {"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":null,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}}
(APIServer pid=203326) INFO 11-17 19:59:38 [model.py:547] Resolved architecture: Qwen2_5_VLForConditionalGeneration
(APIServer pid=203326) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=203326) INFO 11-17 19:59:38 [model.py:1510] Using max model len 8000
(APIServer pid=203326) INFO 11-17 19:59:39 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=15360.
INFO 11-17 19:59:42 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:44 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', speculative_config=None, tokenizer='/workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=base_model, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:45 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:46 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2602] Starting to load model /workspace/Qwen2.5-VL-7B-Instruct-20250829-quantized.w8a8...
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [gpu_model_runner.py:2634] Loading model from scratch...
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [compressed_tensors_w8a8_int8.py:52] Using CutlassScaledMMLinearKernel for CompressedTensorsW8A8Int8
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:47 [cuda.py:366] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.91it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:00<00:00, 2.91it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 2.15it/s]
(EngineCore_DP0 pid=203469)
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [default_loader.py:267] Loading weights took 1.48 seconds
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:2653] Model loading took 9.5183 GiB and 1.635616 seconds
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:49 [gpu_model_runner.py:3344] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=203469) INFO 11-17 19:59:59 [backends.py:559] Dynamo bytecode transform time: 5.03 s
(EngineCore_DP0 pid=203469) INFO 11-17 20:00:01 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=203469) INFO 11-17 20:00:18 [backends.py:218] Compiling a graph for dynamic shape takes 19.18 s
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] self.model_runner.profile_run()
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] outputs = self.model(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] hidden_states = self.language_model.model(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] def forward(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] raise e
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "<eval_with_key>.58", line 317, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7); l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return compiled_fn(full_args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] out = normalize_as_list(f(args))
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] outs = compiled_fn(args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return compiled_fn(runtime_args)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self.current_callable(inputs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] out = model(new_inputs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) ERROR 11-17 20:00:19 [core.py:708] RuntimeError: Int8 not supported for this architecture
(EngineCore_DP0 pid=203469) Process EngineCore_DP0:
(EngineCore_DP0 pid=203469) Traceback (most recent call last):
(EngineCore_DP0 pid=203469) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=203469) self.run()
(EngineCore_DP0 pid=203469) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=203469) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=203469) raise e
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=203469) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=203469) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 92, in __init__
(EngineCore_DP0 pid=203469) self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 190, in _initialize_kv_caches
(EngineCore_DP0 pid=203469) self.model_executor.determine_available_memory())
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 85, in determine_available_memory
(EngineCore_DP0 pid=203469) return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
(EngineCore_DP0 pid=203469) return [run_method(self.driver_worker, method, args, kwargs)]
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3122, in run_method
(EngineCore_DP0 pid=203469) return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 263, in determine_available_memory
(EngineCore_DP0 pid=203469) self.model_runner.profile_run()
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3392, in profile_run
(EngineCore_DP0 pid=203469) = self._dummy_run(self.max_num_tokens, is_profile=True)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=203469) return func(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3152, in _dummy_run
(EngineCore_DP0 pid=203469) outputs = self.model(
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1450, in forward
(EngineCore_DP0 pid=203469) hidden_states = self.language_model.model(
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 310, in __call__
(EngineCore_DP0 pid=203469) output = self.compiled_callable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
(EngineCore_DP0 pid=203469) return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 341, in forward
(EngineCore_DP0 pid=203469) def forward(
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
(EngineCore_DP0 pid=203469) return super().__call__(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
(EngineCore_DP0 pid=203469) return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
(EngineCore_DP0 pid=203469) raise e
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
(EngineCore_DP0 pid=203469) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
(EngineCore_DP0 pid=203469) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
(EngineCore_DP0 pid=203469) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "<eval_with_key>.58", line 317, in forward
(EngineCore_DP0 pid=203469) submod_0 = self.submod_0(l_inputs_embeds_, s59, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, s18, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s7); l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 121, in __call__
(EngineCore_DP0 pid=203469) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 90, in __call__
(EngineCore_DP0 pid=203469) return self.compiled_graph_for_general_shape(*args)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
(EngineCore_DP0 pid=203469) return fn(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1241, in forward
(EngineCore_DP0 pid=203469) return compiled_fn(full_args)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 384, in runtime_wrapper
(EngineCore_DP0 pid=203469) all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=203469) out = normalize_as_list(f(args))
(EngineCore_DP0 pid=203469) ^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 750, in inner_fn
(EngineCore_DP0 pid=203469) outs = compiled_fn(args)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 556, in wrapper
(EngineCore_DP0 pid=203469) return compiled_fn(runtime_args)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 584, in __call__
(EngineCore_DP0 pid=203469) return self.current_callable(inputs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2716, in run
(EngineCore_DP0 pid=203469) out = model(new_inputs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) File "/root/.cache/vllm/torch_compile_cache/b2b9294305/rank_0_0/inductor_cache/vu/cvut3jsiuhyscibpddl2xday6ys7ukt3gdq6tpn3lz2o37xwys7n.py", line 540, in call
(EngineCore_DP0 pid=203469) torch.ops._C.cutlass_scaled_mm.default(buf7, buf0, arg4_1, buf2, arg5_1, arg6_1)
(EngineCore_DP0 pid=203469) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 829, in __call__
(EngineCore_DP0 pid=203469) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=203469) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=203469) RuntimeError: Int8 not supported for this architecture
[rank0]:[W1117 20:00:20.593756440 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=203326) Traceback (most recent call last):
(APIServer pid=203326) File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=203326) File "<frozen runpy>", line 88, in _run_code
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1953, in <module>
(APIServer pid=203326) uvloop.run(run_server(args))
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=203326) return __asyncio.run(
(APIServer pid=203326) ^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=203326) return runner.run(main)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=203326) return self._loop.run_until_complete(task)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=203326) return await main
(APIServer pid=203326) ^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=203326) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=203326) async with build_async_engine_client(
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=203326) return await anext(self.gen)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=203326) async with build_async_engine_client_from_engine_args(
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=203326) return await anext(self.gen)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=203326) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=203326) return fn(*args, **kwargs)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=203326) return cls(
(APIServer pid=203326) ^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=203326) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=203326) return AsyncMPClient(*client_args)
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=203326) super().__init__(
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=203326) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=203326) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=203326) File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=203326) next(self.gen)
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=203326) wait_for_engine_startup(
(APIServer pid=203326) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=203326) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=203326) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}🛠️ Steps to reproduce
I have carefully read the content of vllm-project/vllm#27337, but it does not mention a solution. Could you please provide a way to resolve this issue?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working