Skip to content

PD分离部署DeepSeeK-R1-FP8模型,起tp16卡的prefill服务报错 #1074

@wenruihua

Description

@wenruihua

[Gloo] Rank 0 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 0
[Gloo] Rank 1 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 1
[Gloo] Rank 2 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 2
[Gloo] Rank 3 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 3
[Gloo] Rank 4 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 4
[Gloo] Rank 5 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 5
[Gloo] Rank 6 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 6
[Gloo] Rank 7 is connected to 15 peer ranks. Expected number of connected peer ranks is : 15
INFO 09-30 03:02:12 [prefill_impl.py:33] lock_nccl_group ranks 7
INFO 09-30 03:02:12 [manager.py:193] use req queue ChunkedPrefillQueue
INFO 09-30 03:02:14 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
All deep_gemm operations loaded successfully!
INFO 09-30 03:02:15 [init.py:216] Automatically detected platform cuda.
WARNING 09-30 03:02:15 [light_utils.py:13] lightllm_kernel is not installed, you can't use the api of it.
WARNING 09-30 03:02:16 [nixl_kv_transporter.py:19] nixl is not installed, which is required for pd disagreggation!!!
Process Process-2:9:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/lightllm/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/prefill_node_impl/prefill_kv_move_manager.py", line 233, in _init_env
manager = PrefillKVMoveManager(args, info_queue, mem_queues)
File "/lightllm/lightllm/server/router/model_infer/mode_backend/continues_batch/pd_mode/prefill_node_impl/prefill_kv_move_manager.py", line 40, in init
assert self.dp_world_size <= self.node_world_size
AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions