feat: introduce async rebalance mode for dynamic EPLB#3
Open
TheBasy wants to merge 1 commit into
Open
Conversation
Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background. New CLI argument: - --enable-eplb-rebalance-async: enables asynchronous rebalancing mode Implementation details: - Launch background thread to: - Broadcast logical_count - Compute ExpertLocationMetadata - Store result in _rebalance_result - Use TP barrier via gloo cpu_group (send_single_signal / recv_single_signal) to ensure all ranks atomically enter the counter-swap phase - Introduce yield-based generator to keep decoding loop non-blocking - Model state transfer starts only after TP-wide agreement via _begin_transfer - Sync mode remains unchanged: uses blocking single-thread rebalance This change improves latency stability under dynamic load conditions in MoE models.
6 tasks
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for asynchronous rebalancing in the Expert Parallel Load Balancer (EPLB) to avoid blocking the decoding loop during load analysis. This enables continuous token generation while rebalance computation runs in the background.
New CLI argument:
Implementation details:
This change improves latency stability under dynamic load conditions in MoE models.
Motivation
In Mixture-of-Experts (MoE) models, the Expert Parallel Load Balancer (EPLB) periodically performs load analysis to ensure balanced expert utilization across devices. However, in the current synchronous implementation, this rebalancing process blocks the decoding loop, introducing unpredictable latency spikes—especially under dynamic workloads where frequent rebalance decisions are required. This blocking behavior degrades end-to-end inference performance and undermines the predictability of token generation, which is critical for real-time or interactive applications.
To address this issue, we propose an asynchronous rebalancing mechanism that decouples the computationally intensive load analysis from the token generation pipeline. By offloading rebalance computation to a background thread, the main decoding loop remains non-blocking, enabling continuous token production while maintaining accurate load balancing. This enhancement improves latency stability and system responsiveness under dynamic load conditions, without compromising the correctness or convergence of the rebalancing logic.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist