Skip to content

[DO NOT MERGE]: swap vLLM wheel to upstream v0.21.0 for LoRA + EP#2620

Open
JohannesHa wants to merge 2 commits into
mainfrom
feat/vllm-upstream-lora-ep
Open

[DO NOT MERGE]: swap vLLM wheel to upstream v0.21.0 for LoRA + EP#2620
JohannesHa wants to merge 2 commits into
mainfrom
feat/vllm-upstream-lora-ep

Conversation

@JohannesHa
Copy link
Copy Markdown
Member

@JohannesHa JohannesHa commented May 24, 2026

The PrimeIntellect custom wheel (0.21.0+cu129.r42434.pr39568.a106aa6) was branched off vLLM main before PR #40867 ("Initial EP support for LoRA", merged 2026-05-09) landed, so it still asserts against base_layer.use_ep in FusedMoEWithLoRA and rejects MoE LoRA + EP. Upstream v0.21.0 (cut 2026-05-15) carries the merge and serves EP + LoRA cleanly. Verified locally on 2 GPUs with Qwen3-30B-A3B + jeeejeee/qwen3-moe-text2sql-spider at tp=2, EP=true.

Side effect: the routed-experts / NIXL P/D path that the fork wheel carried is no longer available, so prime-rl's
enable_return_routed_experts reads are guarded with getattr to keep the patches/serving-tokens imports working on the upstream wheel.


Note

Medium Risk
Swapping the x86_64 vllm wheel from a custom fork to upstream may change inference/runtime behavior and performance, especially around P/D (NIXL) and LoRA/EP paths. The code changes are small but sit in inference patching/serving logic and could affect routed-experts response shaping.

Overview
Moves x86_64 installs off the PrimeIntellect custom vllm wheel to the upstream v0.21.0 CUDA 12.9 wheel (with corresponding uv.lock updates), primarily to pick up upstream LoRA + expert-parallel support.

Updates prime-rl’s routed-experts/NIXL integration to be compatible with upstream vLLM by guarding enable_return_routed_experts reads with getattr(..., False) in both the vLLM config __post_init__ patch and the /inference/v1/generate non-streaming response post-processing path.

Reviewed by Cursor Bugbot for commit 80137f3. Bugbot is set up for automated code reviews on this repo. Configure here.

The PrimeIntellect custom wheel (0.21.0+cu129.r42434.pr39568.a106aa6) was
branched off vLLM main before PR #40867 ("Initial EP support for LoRA",
merged 2026-05-09) landed, so it still asserts against
`base_layer.use_ep` in `FusedMoEWithLoRA` and rejects MoE LoRA + EP.
Upstream v0.21.0 (cut 2026-05-15) carries the merge and serves
EP + LoRA cleanly. Verified locally on 2 GPUs with Qwen3-30B-A3B +
jeeejeee/qwen3-moe-text2sql-spider at tp=2, EP=true.

Side effect: the routed-experts / NIXL P/D path that the fork wheel
carried is no longer available, so prime-rl's
`enable_return_routed_experts` reads are guarded with `getattr` to keep
the patches/serving-tokens imports working on the upstream wheel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@JannikSt JannikSt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job

@JohannesHa JohannesHa changed the title feat(inference): swap vLLM wheel to upstream v0.21.0 for LoRA + EP [DO NOT MERGE]: swap vLLM wheel to upstream v0.21.0 for LoRA + EP May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants