Skip to content

feat(inference): opt-in llm-d EPP+Envoy router (drop-in for vllm-router)#2643

Draft
S1ro1 wants to merge 8 commits into
mainfrom
llm-d
Draft

feat(inference): opt-in llm-d EPP+Envoy router (drop-in for vllm-router)#2643
S1ro1 wants to merge 8 commits into
mainfrom
llm-d

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented May 26, 2026

Summary

  • Adds deployment.router_backend = "llm-d" (default "vllm-router") on multi-node and disaggregated deployments. When opted in, the SLURM template launches the upstream llm-d EPP + Envoy sidecar (and pd-sidecar for P/D) instead of vllm-router. No Docker, no Kubernetes, no sudo.
  • Reuses upstream-vetted EPP scoring profiles verbatim: optimized-baseline for multi-node and pd-disaggregation (disagg-headers + prefix-based-pd-decider + disagg-profile-handler + prefill/decode filters + separate prefill/decode scheduling profiles) for P/D.
  • Wires the canonical llm-d pd-sidecar on each decode node: Envoy → sidecar on decode_sidecar_port → sidecar sends a max_tokens=1 prefill to a prefill worker, then forwards the actual decode (with do_remote_prefill=true + kv_transfer_params) to local vLLM. Decode then pulls KV from prefill via NIXL.
  • scripts/install_llmd.sh bootstraps a local Go toolchain, go installs the EPP and pd-sidecar from github.com/llm-d/llm-d-router, and pulls Envoy 1.36 from archive.tetratelabs.io into third_party/llmd/bin/.

Known limitation — renderer / routed-experts unsupported

llm-d's EPP openai-parser understands only OpenAI-format requests (/v1/chat/completions, /v1/completions). prime-rl's renderer/TITO client and routed-experts replay both POST to /inference/v1/generate with the raw-tokens schema (prompt_token_ids), which the EPP rejects with BadRequest - invalid completions request: must have prompt field. Until upstream llm-d adds raw-tokens support, an RLConfig validator blocks the combination at config-load time with a clear error pointing users to vllm-router. Tracked for follow-up with the llm-d maintainers.

Validated on SLURM (Qwen3-30B-A3B, dp=8, 1+1 multi-node and 1+1+1 P/D)

Path Result
router_backend = "llm-d" + multi-node + MITO ✅ all 5 steps, 257 POSTs to /v1/chat/completions, loss=0.0217
router_backend = "llm-d" + P/D + MITO (with pd-sidecar) ✅ canonical PD lifecycle: Envoy → sidecar → prefill+decode split; POST counts balance across both vLLMs (prefill ~31, decode ~36 per ~30 inference requests = 1 prefill + 1 decode each). Step 0 → Step 1 completed cleanly.
router_backend = "llm-d" + renderer / routed-experts ❌ blocked at config-validation (intentional)
router_backend = "vllm-router" (default) unchanged

Files

  • New: scripts/install_llmd.sh, scripts/write_llmd_configs.sh
  • Config: packages/prime-rl-configs/src/prime_rl/configs/{inference,rl}.py (new router_backend, decode_sidecar_port, validator)
  • Plumbing: src/prime_rl/entrypoints/{inference,rl}.py
  • Templates: src/prime_rl/templates/{inference,multi_node_rl}.sbatch.j2 — branch on router_backend; the llm-d branch invokes write_llmd_configs.sh and launches epp/envoy (router) + pd-sidecar (each decode node, PD only).
  • Docs: docs/{slurm,disaggregated-inference,logging}.md, skills/training/start-run/SKILL.md

🤖 Generated with Claude Code

S1ro1 and others added 8 commits May 26, 2026 23:24
Adds `router_backend: Literal["vllm-router", "llm-d"]` to both the RL
`MultiNodeDeploymentConfig` and the per-inference `MultiNodeInferenceDeploymentConfig`
/ `DisaggregatedInferenceDeploymentConfig`. Default stays `vllm-router` so
existing experiments are untouched; `llm-d` opts into the upstream llm-d
EPP+Envoy stack rendered by `scripts/write_llmd_configs.sh`.

An `RLConfig` validator rejects `router_backend = "llm-d"` together with
`orchestrator.use_renderer = true` or routed-experts replay
(`inference.enable_return_routed_experts` / `trainer.enable_router_replay`).
Both features post to `/inference/v1/generate` with prime-rl's raw-tokens
schema (`prompt_token_ids`), which llm-d's EPP openai-parser rejects with
`BadRequest - invalid completions request: must have prompt field`. Until
upstream llm-d adds raw-tokens support, those paths must stay on `vllm-router`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d=llm-d

Both inference.sbatch.j2 and multi_node_rl.sbatch.j2 branch on the new
`router_backend` template var. The `llm-d` branch invokes
`scripts/write_llmd_configs.sh` to render `endpoints.yaml`, `epp.yaml`, and
`envoy.yaml` under `$OUTPUT_DIR/logs/inference/llmd[_$REPLICA_IDX]/`, then
launches `epp` and `envoy` in the background so SLURM's `wait -n` still
catches a router crash.

EPP scoring profiles lifted verbatim from upstream llm-d:
  - multi_node: optimized-baseline (queue / kv-cache-util / prefix-cache /
    max-score-picker; single-profile-handler auto-enabled).
  - disaggregated: pd-disaggregation (disagg-headers + always-pd-decider +
    disagg-profile-handler + prefill/decode filters; separate `prefill` and
    `decode` schedulingProfiles).

Envoy config uses ORIGINAL_DST + ext_proc to let the EPP pick a backend per
request via the `x-gateway-destination-endpoint` header. Routes `/v1/` and
`/inference/v1/` are declared (the validator already blocks the renderer
path), and `processing_mode` matches the upstream test config
(FULL_DUPLEX_STREAMED body + SEND headers/trailers, message_timeout=1000s).

`scripts/install_llmd.sh` bootstraps a local Go toolchain, `go install`s
the EPP from github.com/llm-d/llm-d-router, and fetches Envoy 1.36.0 from
archive.tetratelabs.io into `third_party/llmd/bin/` (which is gitignored).
No Docker, no sudo.

The cleanup heredoc in both templates also kills stale `epp` / `envoy`
processes between runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- slurm.md: new "Router backend" section covering the `router_backend = "llm-d"`
  toggle, install steps, and where EPP/Envoy log files live. Calls out that
  the validator blocks renderer/routed-experts paths.
- disaggregated-inference.md: notes the pd-disaggregation EPP profile we use
  and the renderer/TITO unsupported caveat.
- logging.md: documents the new `llmd_<replica>/` directory containing the
  generated EPP/Envoy/endpoints YAMLs.
- skills/training/start-run/SKILL.md: brief pointer to the new opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without the pd-sidecar the previous P/D path was effectively broken —
EPP picked a decode pod, set `x-prefiller-host-port`, and the request
landed on plain vLLM which ignores the header. Result: everything ran
locally on the decode worker and the prefill pool sat idle.

Per the [llm-d disagg architecture](https://github.com/llm-d/llm-d-router/blob/main/docs/disaggregation.md),
the decode worker must run vLLM **behind a pd-sidecar** that:

1. Receives the inference request (Envoy → sidecar on `DECODE_SIDECAR_PORT`).
2. Reads `x-prefiller-host-port` set by the EPP's `disagg-profile-handler`.
3. Sends a prefill request (`max_tokens=1`) to the prefill worker.
4. Forwards the actual decode (with `do_remote_prefill=true` + the
   prefill's `kv_transfer_params`) to local vLLM. Decode then pulls KV
   from prefill via NIXL.

Changes:
- `install_llmd.sh` also `go install`s `cmd/pd-sidecar` and downloads
  Envoy 1.36 directly (func-e's GH asset URL is broken; older Envoy
  also lacks `FULL_DUPLEX_STREAMED` ext_proc body mode).
- New `inference.deployment.decode_sidecar_port` config field
  (default 8300) plumbed through entrypoints to the SLURM templates.
- `write_llmd_configs.sh` accepts `--decode-sidecar-port N`; when set,
  decode endpoints in `endpoints.yaml` use the sidecar port (not vLLM)
  so EPP's decode-filter routes traffic through the sidecar.
- EPP decider switched from `always-disagg-pd-decider` (testing-only)
  to `prefix-based-pd-decider` with `nonCachedTokens: 8` — canonical
  per the upstream P/D guide.
- Both SLURM templates now launch `pd-sidecar` on each decode node
  with `--kv-connector nixlv2`, `--secure-proxy=false`,
  `--enable-ssrf-protection=false` (skips the K8s InferencePool
  allowlist; we don't need it in standalone). Cleanup pkill targets
  added so stale sidecars don't survive between runs.

Verified on SLURM (Qwen3-30B-A3B, dp=8, 1+2 nodes): Envoy's backend
cluster shows traffic going to the sidecar port, and POST counts split
between prefill (~31) and decode (~36) vLLMs — i.e. one prefill + one
decode call per inference, the canonical PD lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…otal

The sidecar opens one proxy port per local DP rank, so it needs the
intra-node DP count (same value vllm-router gets as
``--intra-node-data-parallel-size``), not the cross-node total.

- inference.sbatch.j2: $EP is NODES_PER_DECODE_REPLICA * GPUS_PER_NODE
  (cross-node), so switch to the templated ``{{ dp_per_node }}``.
- multi_node_rl.sbatch.j2: $DP is NODES_PER_DECODE_REPLICA *
  INFERENCE_DP_LOCAL (cross-node), so switch to $INFERENCE_DP_LOCAL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously endpoints.yaml had one entry per backend node, so EPP saw only
the rank-0 port (8200 for vLLM / 8300 for pd-sidecar). The other ranks
were unreachable from the router; all decode traffic pinned to rank 0
and the per-node DP fan-out the sidecar opened (8301-8307) sat idle.

Fix: `write_llmd_configs.sh` now takes `--dp-size N` and emits N
endpoints per backend (ports `base, base+1, …, base+N-1`). vLLM with
`--api-server-count=N` binds consecutive ports per rank; the pd-sidecar
mirrors them on `primary_port + i`. The EPP's existing queue-scorer and
kv-cache-utilization-scorer (plus prefix-cache) then load-balance across
ranks naturally — no custom dp-profile-handler / X-Data-Parallel-Endpoint
plumbing needed.

Both SLURM templates pass `--dp-size` (`{{ dp_per_node }}` in
inference.sbatch.j2, `$INFERENCE_DP_LOCAL` in multi_node_rl.sbatch.j2 —
the same value vllm-router gets as `--intra-node-data-parallel-size`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the llm-d SLURM branch to vLLM's external load balancing
(https://docs.vllm.ai/en/latest/serving/data_parallel_deployment/#external-load-balancing):
each DP rank runs as its own vLLM process on a consecutive port, with
its own GPU and NIXL side-channel port. The EPP can then list all
per-rank endpoints in endpoints.yaml and distribute via its existing
scorers — no custom dp-profile-handler needed, and ports actually exist
(previous attempt hit SO_REUSEPORT on a single base port).

Concretely:
- multi_node_rl.sbatch.j2 (both PD + multi_node llm-d branches): loop over
  ``INFERENCE_DP_LOCAL`` ranks; each ``uv run inference`` gets
  ``CUDA_VISIBLE_DEVICES=$k``, ``VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600+k))``,
  ``--server.port=$((PORT+k))``, ``--data-parallel-rank=$GLOBAL_RANK``,
  ``--data-parallel-size-local 1``, ``--api-server-count 1``. The per-rank
  ``vllm-extra`` is built fresh (no ``data_parallel_hybrid_lb`` —
  incompatible with external LB), including ``data_parallel_address``,
  ``kv_transfer_config``, and the role's all2all backend.
- inference.sbatch.j2 PD branch: same loop pattern.
- ``ADMIN_URLS`` fans out per rank (``base+0`` through ``base+N-1``) so
  ``init_broadcaster`` / ``/pause`` / ``/update_weights`` hit every rank
  (``gpus_per_server`` collapses to 1 in client.py's NCCL setup).
- write_llmd_configs.sh ``--dp-size N``: endpoints.yaml emits N entries
  per backend with consecutive ports — matches the per-rank vLLM ports
  (and the pd-sidecar's fan-out to ports 8300..8300+N-1).
- rl.py ``auto_setup_inference_client``: force ``dp_rank_count = 1`` when
  router_backend = 'llm-d'. With it >1 the orchestrator pre-pins each
  request to ``X-data-parallel-rank=N`` (vllm-router's contract); with
  llm-d Envoy load-balances freely across the per-rank endpoints, so a
  request can land on a vLLM whose local rank ≠ the header value and
  fail the rank-range check (``data_parallel_rank N is out of range
  [k, k+1)``). EPP-side session affinity can be added later via the
  ``session-affinity-scorer`` if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant