Conversation
Adds `router_backend: Literal["vllm-router", "llm-d"]` to both the RL `MultiNodeDeploymentConfig` and the per-inference `MultiNodeInferenceDeploymentConfig` / `DisaggregatedInferenceDeploymentConfig`. Default stays `vllm-router` so existing experiments are untouched; `llm-d` opts into the upstream llm-d EPP+Envoy stack rendered by `scripts/write_llmd_configs.sh`. An `RLConfig` validator rejects `router_backend = "llm-d"` together with `orchestrator.use_renderer = true` or routed-experts replay (`inference.enable_return_routed_experts` / `trainer.enable_router_replay`). Both features post to `/inference/v1/generate` with prime-rl's raw-tokens schema (`prompt_token_ids`), which llm-d's EPP openai-parser rejects with `BadRequest - invalid completions request: must have prompt field`. Until upstream llm-d adds raw-tokens support, those paths must stay on `vllm-router`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d=llm-d
Both inference.sbatch.j2 and multi_node_rl.sbatch.j2 branch on the new
`router_backend` template var. The `llm-d` branch invokes
`scripts/write_llmd_configs.sh` to render `endpoints.yaml`, `epp.yaml`, and
`envoy.yaml` under `$OUTPUT_DIR/logs/inference/llmd[_$REPLICA_IDX]/`, then
launches `epp` and `envoy` in the background so SLURM's `wait -n` still
catches a router crash.
EPP scoring profiles lifted verbatim from upstream llm-d:
- multi_node: optimized-baseline (queue / kv-cache-util / prefix-cache /
max-score-picker; single-profile-handler auto-enabled).
- disaggregated: pd-disaggregation (disagg-headers + always-pd-decider +
disagg-profile-handler + prefill/decode filters; separate `prefill` and
`decode` schedulingProfiles).
Envoy config uses ORIGINAL_DST + ext_proc to let the EPP pick a backend per
request via the `x-gateway-destination-endpoint` header. Routes `/v1/` and
`/inference/v1/` are declared (the validator already blocks the renderer
path), and `processing_mode` matches the upstream test config
(FULL_DUPLEX_STREAMED body + SEND headers/trailers, message_timeout=1000s).
`scripts/install_llmd.sh` bootstraps a local Go toolchain, `go install`s
the EPP from github.com/llm-d/llm-d-router, and fetches Envoy 1.36.0 from
archive.tetratelabs.io into `third_party/llmd/bin/` (which is gitignored).
No Docker, no sudo.
The cleanup heredoc in both templates also kills stale `epp` / `envoy`
processes between runs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- slurm.md: new "Router backend" section covering the `router_backend = "llm-d"` toggle, install steps, and where EPP/Envoy log files live. Calls out that the validator blocks renderer/routed-experts paths. - disaggregated-inference.md: notes the pd-disaggregation EPP profile we use and the renderer/TITO unsupported caveat. - logging.md: documents the new `llmd_<replica>/` directory containing the generated EPP/Envoy/endpoints YAMLs. - skills/training/start-run/SKILL.md: brief pointer to the new opt-in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the pd-sidecar the previous P/D path was effectively broken — EPP picked a decode pod, set `x-prefiller-host-port`, and the request landed on plain vLLM which ignores the header. Result: everything ran locally on the decode worker and the prefill pool sat idle. Per the [llm-d disagg architecture](https://github.com/llm-d/llm-d-router/blob/main/docs/disaggregation.md), the decode worker must run vLLM **behind a pd-sidecar** that: 1. Receives the inference request (Envoy → sidecar on `DECODE_SIDECAR_PORT`). 2. Reads `x-prefiller-host-port` set by the EPP's `disagg-profile-handler`. 3. Sends a prefill request (`max_tokens=1`) to the prefill worker. 4. Forwards the actual decode (with `do_remote_prefill=true` + the prefill's `kv_transfer_params`) to local vLLM. Decode then pulls KV from prefill via NIXL. Changes: - `install_llmd.sh` also `go install`s `cmd/pd-sidecar` and downloads Envoy 1.36 directly (func-e's GH asset URL is broken; older Envoy also lacks `FULL_DUPLEX_STREAMED` ext_proc body mode). - New `inference.deployment.decode_sidecar_port` config field (default 8300) plumbed through entrypoints to the SLURM templates. - `write_llmd_configs.sh` accepts `--decode-sidecar-port N`; when set, decode endpoints in `endpoints.yaml` use the sidecar port (not vLLM) so EPP's decode-filter routes traffic through the sidecar. - EPP decider switched from `always-disagg-pd-decider` (testing-only) to `prefix-based-pd-decider` with `nonCachedTokens: 8` — canonical per the upstream P/D guide. - Both SLURM templates now launch `pd-sidecar` on each decode node with `--kv-connector nixlv2`, `--secure-proxy=false`, `--enable-ssrf-protection=false` (skips the K8s InferencePool allowlist; we don't need it in standalone). Cleanup pkill targets added so stale sidecars don't survive between runs. Verified on SLURM (Qwen3-30B-A3B, dp=8, 1+2 nodes): Envoy's backend cluster shows traffic going to the sidecar port, and POST counts split between prefill (~31) and decode (~36) vLLMs — i.e. one prefill + one decode call per inference, the canonical PD lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…otal
The sidecar opens one proxy port per local DP rank, so it needs the
intra-node DP count (same value vllm-router gets as
``--intra-node-data-parallel-size``), not the cross-node total.
- inference.sbatch.j2: $EP is NODES_PER_DECODE_REPLICA * GPUS_PER_NODE
(cross-node), so switch to the templated ``{{ dp_per_node }}``.
- multi_node_rl.sbatch.j2: $DP is NODES_PER_DECODE_REPLICA *
INFERENCE_DP_LOCAL (cross-node), so switch to $INFERENCE_DP_LOCAL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously endpoints.yaml had one entry per backend node, so EPP saw only
the rank-0 port (8200 for vLLM / 8300 for pd-sidecar). The other ranks
were unreachable from the router; all decode traffic pinned to rank 0
and the per-node DP fan-out the sidecar opened (8301-8307) sat idle.
Fix: `write_llmd_configs.sh` now takes `--dp-size N` and emits N
endpoints per backend (ports `base, base+1, …, base+N-1`). vLLM with
`--api-server-count=N` binds consecutive ports per rank; the pd-sidecar
mirrors them on `primary_port + i`. The EPP's existing queue-scorer and
kv-cache-utilization-scorer (plus prefix-cache) then load-balance across
ranks naturally — no custom dp-profile-handler / X-Data-Parallel-Endpoint
plumbing needed.
Both SLURM templates pass `--dp-size` (`{{ dp_per_node }}` in
inference.sbatch.j2, `$INFERENCE_DP_LOCAL` in multi_node_rl.sbatch.j2 —
the same value vllm-router gets as `--intra-node-data-parallel-size`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dpoints" This reverts commit a2ec370.
Switches the llm-d SLURM branch to vLLM's external load balancing (https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/#external-load-balancing): each DP rank runs as its own vLLM process on a consecutive port, with its own GPU and NIXL side-channel port. The EPP can then list all per-rank endpoints in endpoints.yaml and distribute via its existing scorers — no custom dp-profile-handler needed, and ports actually exist (previous attempt hit SO_REUSEPORT on a single base port). Concretely: - multi_node_rl.sbatch.j2 (both PD + multi_node llm-d branches): loop over ``INFERENCE_DP_LOCAL`` ranks; each ``uv run inference`` gets ``CUDA_VISIBLE_DEVICES=$k``, ``VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600+k))``, ``--server.port=$((PORT+k))``, ``--data-parallel-rank=$GLOBAL_RANK``, ``--data-parallel-size-local 1``, ``--api-server-count 1``. The per-rank ``vllm-extra`` is built fresh (no ``data_parallel_hybrid_lb`` — incompatible with external LB), including ``data_parallel_address``, ``kv_transfer_config``, and the role's all2all backend. - inference.sbatch.j2 PD branch: same loop pattern. - ``ADMIN_URLS`` fans out per rank (``base+0`` through ``base+N-1``) so ``init_broadcaster`` / ``/pause`` / ``/update_weights`` hit every rank (``gpus_per_server`` collapses to 1 in client.py's NCCL setup). - write_llmd_configs.sh ``--dp-size N``: endpoints.yaml emits N entries per backend with consecutive ports — matches the per-rank vLLM ports (and the pd-sidecar's fan-out to ports 8300..8300+N-1). - rl.py ``auto_setup_inference_client``: force ``dp_rank_count = 1`` when router_backend = 'llm-d'. With it >1 the orchestrator pre-pins each request to ``X-data-parallel-rank=N`` (vllm-router's contract); with llm-d Envoy load-balances freely across the per-rank endpoints, so a request can land on a vLLM whose local rank ≠ the header value and fail the rank-range check (``data_parallel_rank N is out of range [k, k+1)``). EPP-side session affinity can be added later via the ``session-affinity-scorer`` if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
deployment.router_backend = "llm-d"(default"vllm-router") on multi-node and disaggregated deployments. When opted in, the SLURM template launches the upstream llm-d EPP + Envoy sidecar (andpd-sidecarfor P/D) instead ofvllm-router. No Docker, no Kubernetes, no sudo.optimized-baselinefor multi-node andpd-disaggregation(disagg-headers +prefix-based-pd-decider+ disagg-profile-handler + prefill/decode filters + separateprefill/decodescheduling profiles) for P/D.decode_sidecar_port→ sidecar sends amax_tokens=1prefill to a prefill worker, then forwards the actual decode (withdo_remote_prefill=true+kv_transfer_params) to local vLLM. Decode then pulls KV from prefill via NIXL.scripts/install_llmd.shbootstraps a local Go toolchain,go installs the EPP and pd-sidecar fromgithub.com/llm-d/llm-d-router, and pulls Envoy 1.36 fromarchive.tetratelabs.iointothird_party/llmd/bin/.Known limitation — renderer / routed-experts unsupported
llm-d's EPP
openai-parserunderstands only OpenAI-format requests (/v1/chat/completions,/v1/completions). prime-rl's renderer/TITO client and routed-experts replay both POST to/inference/v1/generatewith the raw-tokens schema (prompt_token_ids), which the EPP rejects withBadRequest - invalid completions request: must have prompt field. Until upstream llm-d adds raw-tokens support, anRLConfigvalidator blocks the combination at config-load time with a clear error pointing users tovllm-router. Tracked for follow-up with the llm-d maintainers.Validated on SLURM (Qwen3-30B-A3B, dp=8, 1+1 multi-node and 1+1+1 P/D)
router_backend = "llm-d"+ multi-node + MITO/v1/chat/completions, loss=0.0217router_backend = "llm-d"+ P/D + MITO (with pd-sidecar)router_backend = "llm-d"+ renderer / routed-expertsrouter_backend = "vllm-router"(default)Files
scripts/install_llmd.sh,scripts/write_llmd_configs.shpackages/prime-rl-configs/src/prime_rl/configs/{inference,rl}.py(newrouter_backend,decode_sidecar_port, validator)src/prime_rl/entrypoints/{inference,rl}.pysrc/prime_rl/templates/{inference,multi_node_rl}.sbatch.j2— branch onrouter_backend; thellm-dbranch invokeswrite_llmd_configs.shand launchesepp/envoy(router) +pd-sidecar(each decode node, PD only).docs/{slurm,disaggregated-inference,logging}.md,skills/training/start-run/SKILL.md🤖 Generated with Claude Code