-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix uneven PP layer indices #13282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix uneven PP layer indices #13282
Conversation
Co-authored-by: Xuchun Shang <[email protected]>
Summary of ChangesHello @alpha-baby, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the implementation of pipeline parallelism (PP) by addressing two key areas. It introduces a more balanced strategy for distributing model layers across PP partitions, ensuring that extra layers are spread out rather than concentrated on a single partition. Additionally, it optimizes the initialization of the language model head, ensuring that this final output layer is only instantiated on the last stage of the pipeline, thereby improving resource efficiency. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces two key improvements for pipeline parallelism. First, it corrects the uneven distribution of model layers across pipeline stages by ensuring any remainder layers are distributed one by one to the initial stages, leading to a more balanced workload. Second, it optimizes memory usage by initializing the lm_head only on the last pipeline stage, using a placeholder for other stages. The changes are logical and well-implemented. I have one suggestion to refactor the layer distribution logic for improved clarity and maintainability.
| base_layers = num_hidden_layers // pp_size | ||
| remainder = num_hidden_layers % pp_size | ||
| # Distribute the extra layers to the first 'remainder' partitions | ||
| if pp_rank < remainder: | ||
| # This partition gets one extra layer | ||
| start_layer = pp_rank * (base_layers + 1) | ||
| end_layer = start_layer + (base_layers + 1) | ||
| else: | ||
| # This partition gets only base layers | ||
| start_layer = ( | ||
| remainder * (base_layers + 1) + (pp_rank - remainder) * base_layers | ||
| ) | ||
| end_layer = start_layer + base_layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calculating start_layer and end_layer is correct, but it can be simplified for better readability and maintainability. You can calculate start_layer with a single, more intuitive formula and then determine end_layer based on whether the current rank receives an extra layer.
| base_layers = num_hidden_layers // pp_size | |
| remainder = num_hidden_layers % pp_size | |
| # Distribute the extra layers to the first 'remainder' partitions | |
| if pp_rank < remainder: | |
| # This partition gets one extra layer | |
| start_layer = pp_rank * (base_layers + 1) | |
| end_layer = start_layer + (base_layers + 1) | |
| else: | |
| # This partition gets only base layers | |
| start_layer = ( | |
| remainder * (base_layers + 1) + (pp_rank - remainder) * base_layers | |
| ) | |
| end_layer = start_layer + base_layers | |
| base_layers = num_hidden_layers // pp_size | |
| remainder = num_hidden_layers % pp_size | |
| # Each rank `i` is assigned `base_layers + (1 if i < remainder else 0)` layers. | |
| # The start layer for `pp_rank` is the sum of layers for all previous ranks. | |
| start_layer = pp_rank * base_layers + min(pp_rank, remainder) | |
| if pp_rank < remainder: | |
| # This partition gets an extra layer. | |
| end_layer = start_layer + base_layers + 1 | |
| else: | |
| # This partition gets base layers. | |
| end_layer = start_layer + base_layers |
|
LGTM |
ShangmingCai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic LGTM, this can help us address the unbalanced weight distribution when num_hidden_layers is an odd number (such as 61 layers of DeepSeek, when pp size is 8, the original impl gonna get 7, 7, 7, 7, 7, 7, 7, 12, new impl will be 8, 8, 8, 8, 8, 7, 7, 7). Good enough as a bugfix when SGLANG_PP_LAYER_PARTITION is not set.
nit: wonder if there will be a performance difference when the remainder is placed behind.
According to theoretical analysis! If each layer is the same, the performance will be the same when the remainder is placed behind. |
|
b200 and gb200 ci are flaky when this PR is created (fixed this morning), other tests have passed: https://github.com/sgl-project/sglang/actions/runs/19390457057/job/55561045148 Since this PR also modified some deepseek code, let me merge the main to see if we can pass b200 and gb200 for safety. I will merge it once these two tests pass. |

why fix this?
model: deepseek v3.1
I'm using a model parallelism configuration of TP1 and PP8 for my large model.
My machine's GPU memory is very large, but
max_total_num_tokensis very small, which is not in line with expectations.I found the root cause of this problem through analysis.
I observed that the distribution of layer in each PP rank is particularly uneven, but the memory occupation is particularly different.
fix two point
Fix the uneven layer problem when PP
only pp last rank init ParallelLMHead
thx ! Co auther @XucSh
thx ! Co test @whybeyoung
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist