Skip to content

Conversation

Antlera
Copy link
Collaborator

@Antlera Antlera commented Jun 27, 2025

This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls.

Highlights:

  • New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
  • ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
  • Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
  • Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR.

Copy link
Contributor

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Antlera, thank you for submitting a great PR!

I added some comments. Overall, I think we need to separate ZenFlow code and minimize changes for ZenFlow in existing code

@Antlera
Copy link
Collaborator Author

Antlera commented Jun 27, 2025

Hi @Antlera, thank you for submitting a great PR!

I added some comments. Overall, I think we need to separate ZenFlow code and minimize changes for ZenFlow in existing code

Hi @tohtana, thank you for the thoughtful review and suggestions!

I tried my best to avoid adding ZenFlow logic directly into engine and zero optimizer. But for some shared functions like average_tensor, fully separating it would mean rewriting a large function with mostly duplicated code, which might make future maintenance harder when upstream changes.

I’m happy to improve this further if this is considered a better practice — I’m just not entirely sure if full separation is the right trade-off here.

@tohtana
Copy link
Contributor

tohtana commented Jun 27, 2025

Hi @Antlera,
I agree that it wouldn't be a good idea to try full separation. The engine and optimizer are not currently designed for the flexible extensions.
Can you first try to separate the parts I mentioned first? Then we can discuss if we have a chance to do more. If you have any concern, please share it here.

Antlera and others added 12 commits June 28, 2025 00:27
- Add ZenFlowCPUAdam and ZenFlowSelectiveAdamW for selective updates
- Implement ZenFlowZeroOptimizer and its parallel variant
- Support gradient offloading and communication overlap
- Implement (un)flatten ops for column-major layout

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Define ZenFlowConfig with support for selective update parameters
- Add validation for ZenFlow-related config fields

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Implement ZenFlow configuration and optimizer support in DeepSpeedEngine
- Introduce methods for configuring ZenFlow parameters and handling selective updates
- Enhance optimizer selection logic to accommodate ZenFlow optimizers
- Update step function to manage ZenFlow-specific behaviors during training

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Introduce tests to validate the behavior of DeepSpeedZeroConfig with various configurations for ZenFlowConfig, including stage enumeration and offload optimizer settings.
- Ensure proper coercion of dictionary inputs into ZenFlowConfig and validate error handling for incorrect types.
- Test combined usage of offload_optimizer and zenflow configurations under stage 2.

Signed-off-by: Tingfeng Lan <[email protected]>
- Fix initialization logic for ZenFlowCPUAdam
- Fix gradient update issues in ZenFlowSelectiveAdamW

Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Introduce  tests for ZenFlowSelectiveAdamW covering both offload and non-offload modes.
- Validate step and group_step behavior with selected index updates and temporary parameter storage.
- Ensure correct handling of 1D and 2D parameters, as well as proper gradient/state cleanup after updates.
- Verify state increment logic and compatibility with PyTorch's native AdamW for numerical correctness.

Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Introduce a new tutorial for ZenFlow, detailing its configuration and usage in DeepSpeed.

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]>
- Updated methods to accept communication_data_type as a parameter for better handling of IPG buckets.
- Removed debug print statements to clean up the code.

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
- Move `_configure_zenflow` logic to a standalone `configure_zenflow()` function in `zenflow_utils.py`
- Refactor ZenFlow place to decouple it from ZeRO internals

Signed-off-by: Tingfeng Lan <[email protected]>
- Simplify the `_configure_zenflow` method by assigning it a lambda function that calls `configure_zenflow(self)`.
- Update the optimizer's selective learning rate synchronization to directly reference `self.optimizer._sync_selective_optimizer_lr()`.

Signed-off-by: Tingfeng Lan <[email protected]>
tohtana and others added 4 commits June 30, 2025 13:29
- Fixed the invocation of `reduce_gradients` in ZenFlow + ZeRO Stage 1
- Corrected the reduction logic in `extra_large_grad_reduc` to handle gradient aggregation properly
- Fixed a bug where ZenFlow could not initialize if the user did not provide a dataset

Signed-off-by: Yusen Wu <[email protected]>
- Implemented single-GPU and distributed tests for ZenFlow with ZeRO Stage 1 and 2
- Covered various configurations of selective optimizer offloading, selection strategies (auto/step/epoch), update intervals, and warm-up rounds
- Ensured ZenFlow can initialize and train under different parameter combinations

Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
@Antlera
Copy link
Collaborator Author

Antlera commented Aug 10, 2025

@sfc-gh-truwase All copyright issues have been fixed.

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Guokai Ma <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@delock do you have additional concerns or can we merge this? Thanks

@Antlera
Copy link
Collaborator Author

Antlera commented Aug 11, 2025

The cpu-torch-latest CI is failing because PyTorch has released version 2.8, while the workflow’s pytest invocation still expects --torch_ver="2.7". See HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS --forked -n 4 unit/ --torch_ver="2.7".

Since the workflow installs the latest CPU build by default, it pulled 2.8.0+cpu, which caused the version check in tests/conftest.py to fail and exit.

=================================== FAILURES ===================================
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-3] ______
[gw3] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
___________________ TestNoSyncCtxt.test_zero_stage[0-dtype2] ___________________
[gw0] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-2] ______
[gw1] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu

@sfc-gh-truwase
Copy link
Collaborator

@Antlera should be fixed by #7481

@Antlera
Copy link
Collaborator Author

Antlera commented Aug 11, 2025

@sfc-gh-truwase Merged for checking the new ut.

@Antlera
Copy link
Collaborator Author

Antlera commented Aug 12, 2025

@sfc-gh-truwase Not sure about what happen to the new up-coming errors in the CIs.

It says

Run modal run -m ci.torch_latest
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Token missing. Could not authenticate client. If you have token credentials, │
│ see modal.com/docs/reference/modal.config for setup help. If you are a new   │
│ user, register an account at modal.com, then run `modal token new`.          │
╰──────────────────────────────────────────────────────────────────────────────╯
Error: Process completed with exit code 1.

@Antlera
Copy link
Collaborator Author

Antlera commented Aug 12, 2025

This might be relevant to #7289. Possible problem: The CI failures on forked PRs are due to Modal authentication.
modal run -m ci.torch_latest requires a token stored in the repository’s secrets, but forked PRs cannot access these secrets, resulting in a “Token missing” error.

@Antlera
Copy link
Collaborator Author

Antlera commented Aug 12, 2025

Merged for checking the new CI. Maybe re-run it will solve the problem. I assume this will make this branch up-to-date.

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) August 15, 2025 16:22
@sfc-gh-truwase sfc-gh-truwase merged commit 1d7b90a into deepspeedai:master Aug 15, 2025
10 of 12 checks passed
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
This PR adds a blog post and images for ZenFlow, introducing its design,
benefits, and usage. The blog explains how ZenFlow improves GPU
utilization by overlapping computation and communication during
offloaded training.

See also:
deepspeedai#7391 – core ZenFlow implementation.
[deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - –
benchmarking and fine-tuning example.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Signed-off-by: lym <[email protected]>
LYMDLUT pushed a commit to LYMDLUT/DeepSpeed that referenced this pull request Aug 20, 2025
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.

Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.

---------

Signed-off-by: Tingfeng Lan <[email protected]>
Signed-off-by: Yusen Wu <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Yusen Wu <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Guokai Ma <[email protected]>
Signed-off-by: lym <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants