-
Notifications
You must be signed in to change notification settings - Fork 84
Description
执行PD分离时候的命令和重要结果
1、etcd 启动命令
./ect/etcd --listen-peer-urls 'http://localhost:2390' --listen-client-urls 'http://localhost:2389' --advertise-client-urls 'http://localhost:2391'
2、xllm-server 启动m
ENABLE_DECODE_RESPONSE_TO_SERVICE=true ./build/xllm_service/xllm_master_serving --etcd_addr="127.0.0.1:2389" --http_server_port 28888 --rpc_server_port 28889 --tokenizer_path=/data2/ml/Qwen3-8B
3、xllm prefill引擎启动
./build/xllm/core/server/xllm --model=/data2/ml/Qwen3-8B
--port=8010
--devices="npu:0"
--master_node_addr="127.0.0.1:18888"
--enable_prefix_cache=false
--enable_chunked_prefill=false
--enable_disagg_pd=true
--instance_role=PREFILL
--etcd_addr=127.0.0.1:2389
--transfer_listen_port=26000
--disagg_pd_port=7777
--node_rank=0
--nnodes=1
主要日志:
Disagg PD server started on address 127.0.1.1:7777, idle_timeout_sec: -1, num_threads: 32
I20251118 21:12:34.040596 10905 disagg_pd_scheduler.cpp:106] Instance info: instance name = 127.0.1.1:8010, instance rpc_address = 127.0.1.1:7777, instance type = PREFILL
I20251118 21:12:34.042326 10905 xservice_client.cpp:186] Success register instance to etcd.
I20251118 21:12:34.071521 10905 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251118 21:12:34.434703 10905 server.cpp:1200] Server[xllm::APIService] is serving on port=8010.
I20251118 21:12:34.434741 10905 server.cpp:1203] Check out http://:8010 in web browser.
I20251118 21:12:34.434748 10905 xllm_server.cpp:59] Brpc Server started on port 8010, idle_timeout_s: -1, num_threads: 32
4、xllm decode引擎启动
./build/xllm/core/server/xllm --model=/data2/ml/Qwen3-8B
--port=8020
--devices="npu:1"
--master_node_addr="127.0.0.1:18898"
--enable_prefix_cache=false
--enable_chunked_prefill=false
--enable_disagg_pd=true
--instance_role=DECODE
--etcd_addr=127.0.0.1:2389
--transfer_listen_port=26100
--disagg_pd_port=7787
--node_rank=0
--nnodes=1
主要日志:
Disagg PD server started on address 127.0.1.1:7787, idle_timeout_sec: -1, num_threads: 32
I20251118 21:13:17.597756 11439 disagg_pd_scheduler.cpp:106] Instance info: instance name = 127.0.1.1:8020, instance rpc_address = 127.0.1.1:7787, instance type = DECODE
I20251118 21:13:17.599545 11439 xservice_client.cpp:186] Success register instance to etcd.
I20251118 21:13:17.627733 11439 jinja_chat_template.cpp:30] Jinja chat template init succeed.
I20251118 21:13:17.983762 11439 server.cpp:1200] Server[xllm::APIService] is serving on port=8020.
I20251118 21:13:17.983804 11439 server.cpp:1203] Check out http://:8020 in web browser.
I20251118 21:13:17.983811 11439 xllm_server.cpp:59] Brpc Server started on port 8020, idle_timeout_s: -1, num_threads: 32
5、xllm-server中 显示prefill和decode注册成功
Xllm http server started on: 0.0.0.0:28888
I20251118 21:11:59.924466 10867 master.cpp:134] Xllm rpc server started on: 0.0.0.0:28889
W20251118 21:11:59.924471 10866 controller.cpp:1606] SIGINT was installed with 1
I20251118 21:12:34.042586 10827 instance_mgr.cpp:415] Register a new prefill instance, instance name : 127.0.1.1:8010
I20251118 21:13:17.599794 10827 instance_mgr.cpp:421] Register a new decode instance, instance name : 127.0.1.1:8020
6、测试
镜像:quay.io/jd_xllm/xllm-ai:xllm-dev-hb-rc2-x86
一个机器两张卡,以上4个命令在同一个容器中运行
curl http://localhost:28888/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-8B",
"messages": [
{"role": "system", "content": "你是一个数学家"},
{"role": "user", "content": "1+1=?"}
]
}'
prefill 引擎报如下错误,服务自动退出
*** Check failure stack trace: ***
@ 0x7cd04f google::LogMessage::SendToLog()
@ 0x7c99b2 google::LogMessage::Flush()
@ 0x7cd759 google::LogMessageFatal::~LogMessageFatal()
@ 0xc18cd7 xllm::DisaggPDScheduler::dispatch_requests()
@ 0x7f92716602cf execute_native_thread_routine
@ 0x7f9257664a6e (unknown)
@ 0x7f92576e3cbc (unknown)
@ (nil) (unknown)
Aborted (core dumped)
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
[ERROR] TBE Subprocess[task_distribute] raise error[], main process disappeared!
decode 引擎报如下错误
LinkCluster failed:
E20251118 21:24:28.535575 11474 llm_engine.cpp:632] Link cluster failed.
E20251118 21:24:28.535629 11474 disagg_pd_scheduler.cpp:1115] Link cluster failed!
E20251118 21:24:28.535640 11474 disagg_pd_service_impl.cpp:129] Link instance failed, instance name : 127.0.1.1:8010