-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add Zenflow code for Stage 1 & 2 #7391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Antlera, thank you for submitting a great PR!
I added some comments. Overall, I think we need to separate ZenFlow code and minimize changes for ZenFlow in existing code
Hi @tohtana, thank you for the thoughtful review and suggestions! I tried my best to avoid adding ZenFlow logic directly into engine and zero optimizer. But for some shared functions like average_tensor, fully separating it would mean rewriting a large function with mostly duplicated code, which might make future maintenance harder when upstream changes. I’m happy to improve this further if this is considered a better practice — I’m just not entirely sure if full separation is the right trade-off here. |
Hi @Antlera, |
- Add ZenFlowCPUAdam and ZenFlowSelectiveAdamW for selective updates - Implement ZenFlowZeroOptimizer and its parallel variant - Support gradient offloading and communication overlap - Implement (un)flatten ops for column-major layout Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Define ZenFlowConfig with support for selective update parameters - Add validation for ZenFlow-related config fields Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Implement ZenFlow configuration and optimizer support in DeepSpeedEngine - Introduce methods for configuring ZenFlow parameters and handling selective updates - Enhance optimizer selection logic to accommodate ZenFlow optimizers - Update step function to manage ZenFlow-specific behaviors during training Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Introduce tests to validate the behavior of DeepSpeedZeroConfig with various configurations for ZenFlowConfig, including stage enumeration and offload optimizer settings. - Ensure proper coercion of dictionary inputs into ZenFlowConfig and validate error handling for incorrect types. - Test combined usage of offload_optimizer and zenflow configurations under stage 2. Signed-off-by: Tingfeng Lan <[email protected]>
- Fix initialization logic for ZenFlowCPUAdam - Fix gradient update issues in ZenFlowSelectiveAdamW Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Introduce tests for ZenFlowSelectiveAdamW covering both offload and non-offload modes. - Validate step and group_step behavior with selected index updates and temporary parameter storage. - Ensure correct handling of 1D and 2D parameters, as well as proper gradient/state cleanup after updates. - Verify state increment logic and compatibility with PyTorch's native AdamW for numerical correctness. Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Introduce a new tutorial for ZenFlow, detailing its configuration and usage in DeepSpeed. Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]>
- Updated methods to accept communication_data_type as a parameter for better handling of IPG buckets. - Removed debug print statements to clean up the code. Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Yusen Wu <[email protected]>
- Move `_configure_zenflow` logic to a standalone `configure_zenflow()` function in `zenflow_utils.py` - Refactor ZenFlow place to decouple it from ZeRO internals Signed-off-by: Tingfeng Lan <[email protected]>
- Simplify the `_configure_zenflow` method by assigning it a lambda function that calls `configure_zenflow(self)`. - Update the optimizer's selective learning rate synchronization to directly reference `self.optimizer._sync_selective_optimizer_lr()`. Signed-off-by: Tingfeng Lan <[email protected]>
- Fixed the invocation of `reduce_gradients` in ZenFlow + ZeRO Stage 1 - Corrected the reduction logic in `extra_large_grad_reduc` to handle gradient aggregation properly - Fixed a bug where ZenFlow could not initialize if the user did not provide a dataset Signed-off-by: Yusen Wu <[email protected]>
- Implemented single-GPU and distributed tests for ZenFlow with ZeRO Stage 1 and 2 - Covered various configurations of selective optimizer offloading, selection strategies (auto/step/epoch), update intervals, and warm-up rounds - Ensured ZenFlow can initialize and train under different parameter combinations Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
@sfc-gh-truwase All copyright issues have been fixed. |
Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Guokai Ma <[email protected]>
@delock do you have additional concerns or can we merge this? Thanks |
The Since the workflow installs the latest CPU build by default, it pulled 2.8.0+cpu, which caused the version check in tests/conftest.py to fail and exit. =================================== FAILURES ===================================
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-3] ______
[gw3] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
___________________ TestNoSyncCtxt.test_zero_stage[0-dtype2] ___________________
[gw0] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-2] ______
[gw1] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu |
@sfc-gh-truwase Merged for checking the new ut. |
@sfc-gh-truwase Not sure about what happen to the new up-coming errors in the CIs. It says Run modal run -m ci.torch_latest
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Token missing. Could not authenticate client. If you have token credentials, │
│ see modal.com/docs/reference/modal.config for setup help. If you are a new │
│ user, register an account at modal.com, then run `modal token new`. │
╰──────────────────────────────────────────────────────────────────────────────╯
Error: Process completed with exit code 1. |
This might be relevant to #7289. Possible problem: The CI failures on forked PRs are due to Modal authentication. |
Merged for checking the new CI. Maybe re-run it will solve the problem. I assume this will make this branch up-to-date. |
This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: lym <[email protected]>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <[email protected]> Signed-off-by: Yusen Wu <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Yusen Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Guokai Ma <[email protected]> Signed-off-by: lym <[email protected]>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls.
Highlights:
Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR.