[WIP]Merge ulysses and ring-attention #5522

tastelikefeet · 2025-08-25T10:46:51Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Write the detail information belongs to this PR.

Experiment results

Paste your experiment result here(if needed).

# Conflicts: # swift/trainers/trainers.py

gemini-code-assist

Summary of Changes

Hello @tastelikefeet, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring and enhancement of the sequence parallelism framework. It unifies the previously separate Ulysses and Ring Attention implementations into a single, more flexible SequenceParallel class. The changes enable advanced distributed training strategies, including a new 'ring parallel' dimension and the integration of 'zigzag ring flash attention'. This consolidation simplifies the API for trainers and improves the overall efficiency and scalability of training large language models by optimizing how data and computations are distributed across devices.

Highlights

Unified Sequence Parallelism: The separate Ulysses and RingAttention sequence parallel implementations have been unified into a single, more robust SequenceParallel class within ulysses.py. This streamlines the codebase and provides a single entry point for sequence parallelism configurations.
Introduction of Ring Parallelism (RP): A new dimension of parallelism, 'Ring Parallelism' (RP), has been introduced, allowing for more complex and efficient distribution strategies, particularly for very long sequences. This is reflected in the updated device mesh creation and data handling methods.
Advanced Attention Mechanism Integration: Specific implementations of 'Zigzag Ring Flash Attention' for both standard and variable-length sequences have been integrated, suggesting an advanced approach to attention computation in a distributed setting.
Refactored Sequence Parallel API: The API for interacting with sequence parallelism has been simplified. Instead of multiple init_sequence_parallel and prepare_model calls, a single sequence_parallel.prepare method is now used, making setup more consistent across different training components.
Comprehensive Trainer Adaptations: Various trainers (RLHF, SFT, DPO, GRPO) and core trainer mixins have been updated to seamlessly integrate with the new unified SequenceParallel class. This includes adaptations for data loading, loss computation, input preparation, and gradient accumulation, ensuring compatibility and leveraging the new parallelism features.
Codebase Simplification: Redundant files like swift/trainers/sequence_parallel/base.py and swift/trainers/sequence_parallel/ring_attention.py have been removed, reducing code duplication and simplifying the project structure.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant refactoring by merging the ulysses and ring-attention sequence parallel implementations into a unified SequenceParallel class. This is a great improvement for maintainability. The changes also move sequence parallel-specific logic from a shared utility file into the respective trainer classes, which improves modularity.

I've found a couple of issues that need to be addressed:

A critical bug in swift/trainers/mixin.py where get_sp_dataloader is called with incorrect arguments.
A code duplication issue in swift/trainers/utils.py where a function is defined twice.

Please see my detailed comments below. Once these issues are resolved, this PR will be in a much better shape.

gemini-code-assist · 2025-08-25T10:49:44Z

swift/trainers/mixin.py

            if eval_dataset is None and self.eval_dataset is None:
                raise ValueError('Trainer: evaluation requires an eval_dataset.')
            eval_dataset = eval_dataset if eval_dataset is not None else self.eval_dataset
-            dataloader = sequence_parallel.get_dataloader(self, eval_dataset, self.args.eval_batch_size)
+            dataloader = self.get_sp_dataloader(self, eval_dataset, self.args.eval_batch_size)


The call to get_sp_dataloader has incorrect arguments. The method's signature is get_sp_dataloader(self, dataset, batch_size, skip_batches=0), but it's being called with self as the first explicit argument, which is then passed as the dataset. This will lead to a runtime error.

The correct call should not pass self explicitly.

Suggested change

dataloader = self.get_sp_dataloader(self, eval_dataset, self.args.eval_batch_size)

dataloader = self.get_sp_dataloader(eval_dataset, self.args.eval_batch_size)

gemini-code-assist · 2025-08-25T10:49:44Z

swift/trainers/utils.py

+def per_token_loss_func(outputs, labels, enable_dft_loss: bool = False, **kwargs):
+    """Calculate the per token CE
+
+    Args:
+        outputs: The model outputs containing `logits`
+        labels: The labels
+        enable_dft_loss: Enable dft loss
+    Returns:
+        A tensor after cross_entropy
+    """
+
+    logits = outputs.logits
+    # Upcast to float if we need to compute the loss to avoid potential precision issues
+    logits = logits.float()
+    labels = torch.roll(labels, shifts=-1, dims=-1).view(-1)
+
+    # Flatten the tokens
+    logits = logits.view(-1, logits.shape[-1])
+    # Enable model parallelism
+    labels = labels.to(logits.device)
+    loss = F.cross_entropy(logits, labels, ignore_index=-100, reduction='none')
+    if enable_dft_loss:
+        with torch.no_grad():
+            target_probs = torch.exp(-loss)
+        loss *= target_probs
+    return loss
+


The function per_token_loss_func is defined twice in this file. This new definition at the top of the file is a duplicate of an existing one at the end of the file (lines 87-102). This duplication should be removed to avoid confusion and maintain code clarity.

tastelikefeet and others added 20 commits August 16, 2025 22:27

wip

788b8da

Merge branch 'main' into feat/0816-1

9288f8c

# Conflicts: # swift/trainers/trainers.py

wip

56ae96e

wip

5306869

fix

8d979be

wip

6049ede

wip

5a98c4a

fix

e94dd8a

wip

ea5476b

fix

b96ad03

wip

3a667dd

fix

9f6be2c

Merge commit '0232cf975ff904620e3bc79fc2ef8aff6b915428' into feat/0816-1

75d8d75

# Conflicts: # swift/trainers/trainers.py

wip

1331dd9

wip

8cb5105

fix

5ea2b7c

wip

5ccddfa

wip

22a35e0

wip

29512b4

wip

4f0787e

gemini-code-assist bot reviewed Aug 25, 2025

View reviewed changes

tastelikefeet added 2 commits August 25, 2025 19:37

wip

2b2c325

wip

bb1d57b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]Merge ulysses and ring-attention #5522

[WIP]Merge ulysses and ring-attention #5522

Uh oh!

tastelikefeet commented Aug 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 25, 2025

Uh oh!

gemini-code-assist bot Aug 25, 2025

Uh oh!

Uh oh!

	dataloader = self.get_sp_dataloader(self, eval_dataset, self.args.eval_batch_size)
	dataloader = self.get_sp_dataloader(eval_dataset, self.args.eval_batch_size)

[WIP]Merge ulysses and ring-attention #5522

Are you sure you want to change the base?

[WIP]Merge ulysses and ring-attention #5522

Uh oh!

Conversation

tastelikefeet commented Aug 25, 2025

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!