feat(inference): opt-in llm-d EPP+Envoy router (drop-in for vllm-router) by S1ro1 · Pull Request #2643 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-05-26T17:56:27Z

Summary

Adds deployment.router_backend = "llm-d" (default "vllm-router") on multi-node and disaggregated deployments. When opted in, the SLURM template launches the upstream llm-d EPP + Envoy sidecar (and pd-sidecar for P/D) instead of vllm-router. No Docker, no Kubernetes, no sudo.
Reuses upstream-vetted EPP scoring profiles verbatim: optimized-baseline for multi-node and pd-disaggregation (disagg-headers + prefix-based-pd-decider + disagg-profile-handler + prefill/decode filters + separate prefill/decode scheduling profiles) for P/D.
Wires the canonical llm-d pd-sidecar on each decode node: Envoy → sidecar on decode_sidecar_port → sidecar sends a max_tokens=1 prefill to a prefill worker, then forwards the actual decode (with do_remote_prefill=true + kv_transfer_params) to local vLLM. Decode then pulls KV from prefill via NIXL.
scripts/install_llmd.sh bootstraps a local Go toolchain, go installs the EPP and pd-sidecar from github.com/llm-d/llm-d-router, and pulls Envoy 1.36 from archive.tetratelabs.io into third_party/llmd/bin/.

Known limitation — renderer / routed-experts unsupported

llm-d's EPP openai-parser understands only OpenAI-format requests (/v1/chat/completions, /v1/completions). prime-rl's renderer/TITO client and routed-experts replay both POST to /inference/v1/generate with the raw-tokens schema (prompt_token_ids), which the EPP rejects with BadRequest - invalid completions request: must have prompt field. Until upstream llm-d adds raw-tokens support, an RLConfig validator blocks the combination at config-load time with a clear error pointing users to vllm-router. Tracked for follow-up with the llm-d maintainers.

Validated on SLURM (Qwen3-30B-A3B, dp=8, 1+1 multi-node and 1+1+1 P/D)

Path	Result
`router_backend = "llm-d"` + multi-node + MITO	✅ all 5 steps, 257 POSTs to `/v1/chat/completions`, loss=0.0217
`router_backend = "llm-d"` + P/D + MITO (with pd-sidecar)	✅ canonical PD lifecycle: Envoy → sidecar → prefill+decode split; POST counts balance across both vLLMs (prefill ~31, decode ~36 per ~30 inference requests = 1 prefill + 1 decode each). Step 0 → Step 1 completed cleanly.
`router_backend = "llm-d"` + renderer / routed-experts	❌ blocked at config-validation (intentional)
`router_backend = "vllm-router"` (default)	unchanged

Files

New: scripts/install_llmd.sh, scripts/write_llmd_configs.sh
Config: packages/prime-rl-configs/src/prime_rl/configs/{inference,rl}.py (new router_backend, decode_sidecar_port, validator)
Plumbing: src/prime_rl/entrypoints/{inference,rl}.py
Templates: src/prime_rl/templates/{inference,multi_node_rl}.sbatch.j2 — branch on router_backend; the llm-d branch invokes write_llmd_configs.sh and launches epp/envoy (router) + pd-sidecar (each decode node, PD only).
Docs: docs/{slurm,disaggregated-inference,logging}.md, skills/training/start-run/SKILL.md

🤖 Generated with Claude Code

Adds `router_backend: Literal["vllm-router", "llm-d"]` to both the RL `MultiNodeDeploymentConfig` and the per-inference `MultiNodeInferenceDeploymentConfig` / `DisaggregatedInferenceDeploymentConfig`. Default stays `vllm-router` so existing experiments are untouched; `llm-d` opts into the upstream llm-d EPP+Envoy stack rendered by `scripts/write_llmd_configs.sh`. An `RLConfig` validator rejects `router_backend = "llm-d"` together with `orchestrator.use_renderer = true` or routed-experts replay (`inference.enable_return_routed_experts` / `trainer.enable_router_replay`). Both features post to `/inference/v1/generate` with prime-rl's raw-tokens schema (`prompt_token_ids`), which llm-d's EPP openai-parser rejects with `BadRequest - invalid completions request: must have prompt field`. Until upstream llm-d adds raw-tokens support, those paths must stay on `vllm-router`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d=llm-d Both inference.sbatch.j2 and multi_node_rl.sbatch.j2 branch on the new `router_backend` template var. The `llm-d` branch invokes `scripts/write_llmd_configs.sh` to render `endpoints.yaml`, `epp.yaml`, and `envoy.yaml` under `$OUTPUT_DIR/logs/inference/llmd[_$REPLICA_IDX]/`, then launches `epp` and `envoy` in the background so SLURM's `wait -n` still catches a router crash. EPP scoring profiles lifted verbatim from upstream llm-d: - multi_node: optimized-baseline (queue / kv-cache-util / prefix-cache / max-score-picker; single-profile-handler auto-enabled). - disaggregated: pd-disaggregation (disagg-headers + always-pd-decider + disagg-profile-handler + prefill/decode filters; separate `prefill` and `decode` schedulingProfiles). Envoy config uses ORIGINAL_DST + ext_proc to let the EPP pick a backend per request via the `x-gateway-destination-endpoint` header. Routes `/v1/` and `/inference/v1/` are declared (the validator already blocks the renderer path), and `processing_mode` matches the upstream test config (FULL_DUPLEX_STREAMED body + SEND headers/trailers, message_timeout=1000s). `scripts/install_llmd.sh` bootstraps a local Go toolchain, `go install`s the EPP from github.com/llm-d/llm-d-router, and fetches Envoy 1.36.0 from archive.tetratelabs.io into `third_party/llmd/bin/` (which is gitignored). No Docker, no sudo. The cleanup heredoc in both templates also kills stale `epp` / `envoy` processes between runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- slurm.md: new "Router backend" section covering the `router_backend = "llm-d"` toggle, install steps, and where EPP/Envoy log files live. Calls out that the validator blocks renderer/routed-experts paths. - disaggregated-inference.md: notes the pd-disaggregation EPP profile we use and the renderer/TITO unsupported caveat. - logging.md: documents the new `llmd_<replica>/` directory containing the generated EPP/Envoy/endpoints YAMLs. - skills/training/start-run/SKILL.md: brief pointer to the new opt-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without the pd-sidecar the previous P/D path was effectively broken — EPP picked a decode pod, set `x-prefiller-host-port`, and the request landed on plain vLLM which ignores the header. Result: everything ran locally on the decode worker and the prefill pool sat idle. Per the [llm-d disagg architecture](https://github.com/llm-d/llm-d-router/blob/main/docs/disaggregation.md), the decode worker must run vLLM **behind a pd-sidecar** that: 1. Receives the inference request (Envoy → sidecar on `DECODE_SIDECAR_PORT`). 2. Reads `x-prefiller-host-port` set by the EPP's `disagg-profile-handler`. 3. Sends a prefill request (`max_tokens=1`) to the prefill worker. 4. Forwards the actual decode (with `do_remote_prefill=true` + the prefill's `kv_transfer_params`) to local vLLM. Decode then pulls KV from prefill via NIXL. Changes: - `install_llmd.sh` also `go install`s `cmd/pd-sidecar` and downloads Envoy 1.36 directly (func-e's GH asset URL is broken; older Envoy also lacks `FULL_DUPLEX_STREAMED` ext_proc body mode). - New `inference.deployment.decode_sidecar_port` config field (default 8300) plumbed through entrypoints to the SLURM templates. - `write_llmd_configs.sh` accepts `--decode-sidecar-port N`; when set, decode endpoints in `endpoints.yaml` use the sidecar port (not vLLM) so EPP's decode-filter routes traffic through the sidecar. - EPP decider switched from `always-disagg-pd-decider` (testing-only) to `prefix-based-pd-decider` with `nonCachedTokens: 8` — canonical per the upstream P/D guide. - Both SLURM templates now launch `pd-sidecar` on each decode node with `--kv-connector nixlv2`, `--secure-proxy=false`, `--enable-ssrf-protection=false` (skips the K8s InferencePool allowlist; we don't need it in standalone). Cleanup pkill targets added so stale sidecars don't survive between runs. Verified on SLURM (Qwen3-30B-A3B, dp=8, 1+2 nodes): Envoy's backend cluster shows traffic going to the sidecar port, and POST counts split between prefill (~31) and decode (~36) vLLMs — i.e. one prefill + one decode call per inference, the canonical PD lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…otal The sidecar opens one proxy port per local DP rank, so it needs the intra-node DP count (same value vllm-router gets as ``--intra-node-data-parallel-size``), not the cross-node total. - inference.sbatch.j2: $EP is NODES_PER_DECODE_REPLICA * GPUS_PER_NODE (cross-node), so switch to the templated ``{{ dp_per_node }}``. - multi_node_rl.sbatch.j2: $DP is NODES_PER_DECODE_REPLICA * INFERENCE_DP_LOCAL (cross-node), so switch to $INFERENCE_DP_LOCAL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously endpoints.yaml had one entry per backend node, so EPP saw only the rank-0 port (8200 for vLLM / 8300 for pd-sidecar). The other ranks were unreachable from the router; all decode traffic pinned to rank 0 and the per-node DP fan-out the sidecar opened (8301-8307) sat idle. Fix: `write_llmd_configs.sh` now takes `--dp-size N` and emits N endpoints per backend (ports `base, base+1, …, base+N-1`). vLLM with `--api-server-count=N` binds consecutive ports per rank; the pd-sidecar mirrors them on `primary_port + i`. The EPP's existing queue-scorer and kv-cache-utilization-scorer (plus prefix-cache) then load-balance across ranks naturally — no custom dp-profile-handler / X-Data-Parallel-Endpoint plumbing needed. Both SLURM templates pass `--dp-size` (`{{ dp_per_node }}` in inference.sbatch.j2, `$INFERENCE_DP_LOCAL` in multi_node_rl.sbatch.j2 — the same value vllm-router gets as `--intra-node-data-parallel-size`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dpoints" This reverts commit a2ec370.

Switches the llm-d SLURM branch to vLLM's external load balancing (https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/#external-load-balancing): each DP rank runs as its own vLLM process on a consecutive port, with its own GPU and NIXL side-channel port. The EPP can then list all per-rank endpoints in endpoints.yaml and distribute via its existing scorers — no custom dp-profile-handler needed, and ports actually exist (previous attempt hit SO_REUSEPORT on a single base port). Concretely: - multi_node_rl.sbatch.j2 (both PD + multi_node llm-d branches): loop over ``INFERENCE_DP_LOCAL`` ranks; each ``uv run inference`` gets ``CUDA_VISIBLE_DEVICES=$k``, ``VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600+k))``, ``--server.port=$((PORT+k))``, ``--data-parallel-rank=$GLOBAL_RANK``, ``--data-parallel-size-local 1``, ``--api-server-count 1``. The per-rank ``vllm-extra`` is built fresh (no ``data_parallel_hybrid_lb`` — incompatible with external LB), including ``data_parallel_address``, ``kv_transfer_config``, and the role's all2all backend. - inference.sbatch.j2 PD branch: same loop pattern. - ``ADMIN_URLS`` fans out per rank (``base+0`` through ``base+N-1``) so ``init_broadcaster`` / ``/pause`` / ``/update_weights`` hit every rank (``gpus_per_server`` collapses to 1 in client.py's NCCL setup). - write_llmd_configs.sh ``--dp-size N``: endpoints.yaml emits N entries per backend with consecutive ports — matches the per-rank vLLM ports (and the pd-sidecar's fan-out to ports 8300..8300+N-1). - rl.py ``auto_setup_inference_client``: force ``dp_rank_count = 1`` when router_backend = 'llm-d'. With it >1 the orchestrator pre-pins each request to ``X-data-parallel-rank=N`` (vllm-router's contract); with llm-d Envoy load-balances freely across the per-rank endpoints, so a request can land on a vLLM whose local rank ≠ the header value and fail the rank-range check (``data_parallel_rank N is out of range [k, k+1)``). EPP-side session affinity can be added later via the ``session-affinity-scorer`` if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S1ro1 and others added 8 commits May 26, 2026 23:24

Revert "feat(llm-d): EPP-side intra-node DP awareness via per-rank en…

c47d30d

…dpoints" This reverts commit a2ec370.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): opt-in llm-d EPP+Envoy router (drop-in for vllm-router)#2643

feat(inference): opt-in llm-d EPP+Envoy router (drop-in for vllm-router)#2643
S1ro1 wants to merge 8 commits into
mainfrom
llm-d

S1ro1 commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

S1ro1 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known limitation — renderer / routed-experts unsupported

Validated on SLURM (Qwen3-30B-A3B, dp=8, 1+1 multi-node and 1+1+1 P/D)

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

S1ro1 commented May 26, 2026 •

edited

Loading