[DO NOT MERGE]: swap vLLM wheel to upstream v0.21.0 for LoRA + EP#2620
Open
JohannesHa wants to merge 2 commits into
Open
[DO NOT MERGE]: swap vLLM wheel to upstream v0.21.0 for LoRA + EP#2620JohannesHa wants to merge 2 commits into
JohannesHa wants to merge 2 commits into
Conversation
The PrimeIntellect custom wheel (0.21.0+cu129.r42434.pr39568.a106aa6) was
branched off vLLM main before PR #40867 ("Initial EP support for LoRA",
merged 2026-05-09) landed, so it still asserts against
`base_layer.use_ep` in `FusedMoEWithLoRA` and rejects MoE LoRA + EP.
Upstream v0.21.0 (cut 2026-05-15) carries the merge and serves
EP + LoRA cleanly. Verified locally on 2 GPUs with Qwen3-30B-A3B +
jeeejeee/qwen3-moe-text2sql-spider at tp=2, EP=true.
Side effect: the routed-experts / NIXL P/D path that the fork wheel
carried is no longer available, so prime-rl's
`enable_return_routed_experts` reads are guarded with `getattr` to keep
the patches/serving-tokens imports working on the upstream wheel.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PrimeIntellect custom wheel (0.21.0+cu129.r42434.pr39568.a106aa6) was branched off vLLM main before PR #40867 ("Initial EP support for LoRA", merged 2026-05-09) landed, so it still asserts against
base_layer.use_epinFusedMoEWithLoRAand rejects MoE LoRA + EP. Upstream v0.21.0 (cut 2026-05-15) carries the merge and serves EP + LoRA cleanly. Verified locally on 2 GPUs with Qwen3-30B-A3B + jeeejeee/qwen3-moe-text2sql-spider at tp=2, EP=true.Side effect: the routed-experts / NIXL P/D path that the fork wheel carried is no longer available, so prime-rl's
enable_return_routed_expertsreads are guarded withgetattrto keep the patches/serving-tokens imports working on the upstream wheel.Note
Medium Risk
Swapping the x86_64
vllmwheel from a custom fork to upstream may change inference/runtime behavior and performance, especially around P/D (NIXL) and LoRA/EP paths. The code changes are small but sit in inference patching/serving logic and could affect routed-experts response shaping.Overview
Moves x86_64 installs off the PrimeIntellect custom
vllmwheel to the upstreamv0.21.0CUDA 12.9 wheel (with correspondinguv.lockupdates), primarily to pick up upstream LoRA + expert-parallel support.Updates prime-rl’s routed-experts/NIXL integration to be compatible with upstream vLLM by guarding
enable_return_routed_expertsreads withgetattr(..., False)in both the vLLM config__post_init__patch and the/inference/v1/generatenon-streaming response post-processing path.Reviewed by Cursor Bugbot for commit 80137f3. Bugbot is set up for automated code reviews on this repo. Configure here.