-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
Your current environment
Can't get vLLM to start with the below configuration. Seems to have issues loading in the model .safetensors. Any ideas on what could be causing it?
vllm version: 0.11.1
CPU: Intel Xeon w7-2595X
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
Model: https://huggingface.co/openai/gpt-oss-120b/tree/main
Command:
docker run --rm --name vllm --gpus=all --runtime=nvidia -p 8000:8000 -e HF_HUB_OFFLINE=1 --ipc=host -v opt/models/cache/:/root/.cache/huggingface/hub vllm/vllm-openai:latest --model openai/gpt-oss-120b
Also tried:
docker run --rm --name vllm --gpus=all --runtime=nvidia -p 8000:8000 -e HF_HUB_OFFLINE=1 --ipc=host -v opt/models/cache/:/root/.cache/huggingface/hub vllm/vllm-openai:latest --model openai/gpt-oss-120b
with the same output.
Output:
INFO 11-12 06:23:18 [init.py:216] Automatically detected platform cuda.
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:21 [api_server.py:1839] vLLM API server version 0.11.0
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:21 [utils.py:233] non-default args: {'model': 'openai/gpt-oss-120b'}
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:21 [arg_utils.py:504] HF_HUB_OFFLINE is True, replace model_id [openai/gpt-oss-120b] to model_path [/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a]
�[1;36m(APIServer pid=1)�[0;0m torch_dtype is deprecated! Use dtype instead!
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:26 [model.py:547] Resolved architecture: GptOssForCausalLM
�[1;36m(APIServer pid=1)�[0;0m ERROR 11-12 06:23:26 [config.py:278] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a'. Use repo_type argument if needed., retrying 1 of 2
�[1;36m(APIServer pid=1)�[0;0m ERROR 11-12 06:23:28 [config.py:276] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a'. Use repo_type argument if needed.
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:28 [model.py:1730] Downcasting torch.float32 to torch.bfloat16.
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:28 [model.py:1510] Using max model len 131072
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:29 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
�[1;36m(APIServer pid=1)�[0;0m INFO 11-12 06:23:29 [config.py:271] Overriding max cuda graph capture size to 992 for performance.
INFO 11-12 06:23:31 [init.py:216] Automatically detected platform cuda.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:33 [core.py:644] Waiting for init message from front-end.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:33 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":992,"local_cache_dir":null}
�[1;36m(EngineCore_DP0 pid=308)�[0;0m W1112 06:23:33.831000 308 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m W1112 06:23:33.831000 308 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W1112 06:23:34.766974120 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W1112 06:23:34.782231401 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:34 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:34 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:34 [gpu_model_runner.py:2602] Starting to load model /root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots/b5c939de8f754692c1647ca79fbf85e8c1e70f8a...
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:35 [gpu_model_runner.py:2634] Loading model from scratch...
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:35 [cuda.py:361] Using Triton backend on V1 engine.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m INFO 11-12 06:23:35 [mxfp4.py:98] Using Marlin backend
�[1;36m(EngineCore_DP0 pid=308)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
�[1;36m(EngineCore_DP0 pid=308)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
�[1;36m(EngineCore_DP0 pid=308)�[0;0m
�[1;36m(EngineCore_DP0 pid=308)�[0;0m Process EngineCore_DP0:
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] EngineCore failed to start.
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] Traceback (most recent call last):
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] engine_core = EngineCoreProc(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] super().init(vllm_config, executor_class, log_stats,
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self.model_executor = executor_class(vllm_config)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self._init_executor()
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self.collective_rpc("load_model")
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] return [run_method(self.driver_worker, method, args, kwargs)]
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3122, in run_method
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] return func(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self.model_runner.load_model(eep_scale_up=eep_scale_up)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self.model = model_loader.load_model(
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] self.load_weights(model, model_config)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 264, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] loaded_weights = model.load_weights(
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 712, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 294, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] autoloaded_weights = set(self._load_module("", self.module, weights))
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 243, in _load_module
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] for child_prefix, child_weights in self._groupby_prefix(weights):
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 132, in _groupby_prefix
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] for prefix, group in itertools.groupby(weights_by_parts,
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 130, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] for weight_name, weight_data in weights)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] weights = ((name, weight) for name, weight in weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 67, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] return ((out_name, data) for name, data in weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 246, in get_all_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] yield from self._get_weights_iterator(primary_weights)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 230, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] for (name, tensor) in weights_iterator)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 584, in safetensors_weights_iterator
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] with safe_open(st_file, framework="pt") as f:
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ERROR 11-12 06:23:35 [core.py:708] safetensors_rust.SafetensorError: Error while deserializing header: invalid JSON in header: EOF while parsing a value at line 1 column 0
�[1;36m(EngineCore_DP0 pid=308)�[0;0m Traceback (most recent call last):
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.run()
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self._target(*self._args, **self._kwargs)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
�[1;36m(EngineCore_DP0 pid=308)�[0;0m raise e
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
�[1;36m(EngineCore_DP0 pid=308)�[0;0m engine_core = EngineCoreProc(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m super().init(vllm_config, executor_class, log_stats,
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.model_executor = executor_class(vllm_config)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in init
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self._init_executor()
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 55, in _init_executor
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.collective_rpc("load_model")
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
�[1;36m(EngineCore_DP0 pid=308)�[0;0m return [run_method(self.driver_worker, method, args, kwargs)]
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 3122, in run_method
�[1;36m(EngineCore_DP0 pid=308)�[0;0m return func(*args, **kwargs)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.model_runner.load_model(eep_scale_up=eep_scale_up)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2635, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.model = model_loader.load_model(
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
�[1;36m(EngineCore_DP0 pid=308)�[0;0m self.load_weights(model, model_config)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 264, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m loaded_weights = model.load_weights(
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gpt_oss.py", line 712, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 294, in load_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m autoloaded_weights = set(self._load_module("", self.module, weights))
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 243, in _load_module
�[1;36m(EngineCore_DP0 pid=308)�[0;0m for child_prefix, child_weights in self._groupby_prefix(weights):
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 132, in _groupby_prefix
�[1;36m(EngineCore_DP0 pid=308)�[0;0m for prefix, group in itertools.groupby(weights_by_parts,
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 130, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m for weight_name, weight_data in weights)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 291, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m weights = ((name, weight) for name, weight in weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 67, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m return ((out_name, data) for name, data in weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 246, in get_all_weights
�[1;36m(EngineCore_DP0 pid=308)�[0;0m yield from self._get_weights_iterator(primary_weights)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 230, in
�[1;36m(EngineCore_DP0 pid=308)�[0;0m for (name, tensor) in weights_iterator)
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/weight_utils.py", line 584, in safetensors_weights_iterator
�[1;36m(EngineCore_DP0 pid=308)�[0;0m with safe_open(st_file, framework="pt") as f:
�[1;36m(EngineCore_DP0 pid=308)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(EngineCore_DP0 pid=308)�[0;0m safetensors_rust.SafetensorError: Error while deserializing header: invalid JSON in header: EOF while parsing a value at line 1 column 0
[rank0]:[W1112 06:23:35.068758912 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
�[1;36m(APIServer pid=1)�[0;0m Traceback (most recent call last):
�[1;36m(APIServer pid=1)�[0;0m File "", line 198, in _run_module_as_main
�[1;36m(APIServer pid=1)�[0;0m File "", line 88, in _run_code
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1953, in
�[1;36m(APIServer pid=1)�[0;0m uvloop.run(run_server(args))
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
�[1;36m(APIServer pid=1)�[0;0m return __asyncio.run(
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
�[1;36m(APIServer pid=1)�[0;0m return runner.run(main)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
�[1;36m(APIServer pid=1)�[0;0m return self._loop.run_until_complete(task)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
�[1;36m(APIServer pid=1)�[0;0m return await main
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
�[1;36m(APIServer pid=1)�[0;0m await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
�[1;36m(APIServer pid=1)�[0;0m async with build_async_engine_client(
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
�[1;36m(APIServer pid=1)�[0;0m return await anext(self.gen)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
�[1;36m(APIServer pid=1)�[0;0m async with build_async_engine_client_from_engine_args(
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
�[1;36m(APIServer pid=1)�[0;0m return await anext(self.gen)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
�[1;36m(APIServer pid=1)�[0;0m async_llm = AsyncLLM.from_vllm_config(
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/utils/init.py", line 1572, in inner
�[1;36m(APIServer pid=1)�[0;0m return fn(*args, **kwargs)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
�[1;36m(APIServer pid=1)�[0;0m return cls(
�[1;36m(APIServer pid=1)�[0;0m ^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in init
�[1;36m(APIServer pid=1)�[0;0m self.engine_core = EngineCoreClient.make_async_mp_client(
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
�[1;36m(APIServer pid=1)�[0;0m return AsyncMPClient(*client_args)
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in init
�[1;36m(APIServer pid=1)�[0;0m super().init(
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in init
�[1;36m(APIServer pid=1)�[0;0m with launch_core_engines(vllm_config, executor_class,
�[1;36m(APIServer pid=1)�[0;0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(APIServer pid=1)�[0;0m File "/usr/lib/python3.12/contextlib.py", line 144, in exit
�[1;36m(APIServer pid=1)�[0;0m next(self.gen)
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
�[1;36m(APIServer pid=1)�[0;0m wait_for_engine_startup(
�[1;36m(APIServer pid=1)�[0;0m File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
�[1;36m(APIServer pid=1)�[0;0m raise RuntimeError("Engine core initialization failed. "
�[1;36m(APIServer pid=1)�[0;0m RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.