huggingface
diff --git a/‎docs/source/en/deepspeed.md‎
Lines changed: 102 additions & 0 deletions b/‎docs/source/en/deepspeed.md‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎src/transformers/core_model_loading.py‎
Lines changed: 5 additions & 2 deletions b/‎src/transformers/core_model_loading.py‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎src/transformers/masking_utils.py‎
Lines changed: 10 additions & 8 deletions b/‎src/transformers/masking_utils.py‎
Lines changed: 10 additions & 8 deletions
diff --git a/‎src/transformers/modeling_flash_attention_utils.py‎
Lines changed: 8 additions & 2 deletions b/‎src/transformers/modeling_flash_attention_utils.py‎
Lines changed: 8 additions & 2 deletions
@@ -368,6 +368,108 @@ The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter val
 }
 ```
 
+### Sequence Parallelism
+
+DeepSpeed's ALST/Ulysses sequence parallelism enables training with very long sequences by splitting the sequence across multiple GPUs. This is particularly useful for training large language models with very long sequence lengths.
+
+Arctic Long Sequence Training (ALST) uses a combination of sharding inputs along the sequence dimension and attention head parallelism. With this approach, you can train models with sequence lengths up to 500K tokens on a single H100 GPU, 3.7M on a single node, or 15M tokens on just four nodes with Llama-8B. The implementation described here enables one component of the full ALST system. For additional optimizations like TiledMLP and activation checkpoint offloading, refer to the [DeepSpeed ALST tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/).
+
+> [!TIP]
+> For more detailed information about sequence parallelism, see the Accelerate [Sequence Parallelism](https://huggingface.co/docs/accelerate/concept_guides/sequence_parallelism) guide.
+
+To enable ALST/Ulysses sequence parallelism with [`Trainer`], configure `parallelism_config` in [`TrainingArguments`]. Sequence parallelism is configured via Accelerate's `ParallelismConfig` and requires an Accelerate version higher than 1.12.0.
+
+```py
+from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig
+
+# Example: 4 GPUs with sp_size=4, dp_replicate_size=1 (no data parallelism)
+# Ensure total_size = dp_replicate_size * dp_shard_size * sp_size = 1 * 1 * 4 = 4 GPUs
+parallelism_config = ParallelismConfig(
+    sp_backend="deepspeed",
+    sp_size=4,  # Number of GPUs to split sequence across
+    dp_replicate_size=1,  # Explicit: no data parallelism
+    sp_handler=DeepSpeedSequenceParallelConfig(
+        sp_seq_length_is_variable=True,
+        sp_attn_implementation="sdpa",
+    ),
+)
+
+training_args = TrainingArguments(
+    ...,
+    deepspeed="path/to/deepspeed_config.json",
+    parallelism_config=parallelism_config,
+)
+```
+
+You can also configure sequence parallelism using an Accelerate config file.
+
+```yaml
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: path/to/ds_config.json
+machine_rank: 0
+num_machines: 1
+num_processes: 4  # Total number of processes
+parallelism_config:
+  parallelism_config_sp_size: 4  # Sequence parallel size
+  parallelism_config_dp_replicate_size: 1  # Must be: dp_replicate_size * dp_shard_size * sp_size = num_processes
+  parallelism_config_sp_backend: deepspeed
+  parallelism_config_sp_seq_length_is_variable: true
+  parallelism_config_sp_attn_implementation: sdpa
+```
+
+Important configuration parameters include the following.
+
+* `sp_backend` must be set to `"deepspeed"` to use ALST/Ulysses sequence parallelism.
+* `sp_size` is the degree of sequence parallelism. For example, `sp_size=4` means 4 GPUs will process a single sequence in parallel. You need at least 2 GPUs to enable sequence parallelism. **Data feeding**: Each rank receives a unique data stream from the DataLoader (like DP). **Batch size calculation**: The effective `dp_world_size = world_size / sp_size`. So with 4 GPUs and `sp_size=4`, each of the 4 ranks gets different samples from the DataLoader, but `dp_world_size=1` for total batch size calculations
+* `sp_seq_length_is_variable` determines how sequence lengths are handled. When set to `True` (recommended), the implementation adapts to varying sequence lengths between batches. When `False`, all sequences must be padded to a fixed length specified by `sp_seq_length`.
+* `sp_attn_implementation` specifies the attention implementation to use. Supported values are `"sdpa"`, `"flash_attention_2"`, or `"flash_attention_3"`. Flash Attention is recommended for best performance, especially with multiple samples in a batch, because SDPA may incorrectly attend across sample boundaries.
+
+> [!WARNING]
+> Sequence parallelism requires your model to use one of the supported attention implementations (`sdpa`, `flash_attention_2`, or `flash_attention_3`). The `eager` attention implementation is not supported because it doesn't properly handle `position_ids`.
+
+When using sequence parallelism, ensure your sequences are properly padded. Use `pad_to_multiple_of` in your data collator to ensure sequences are divisible by `sp_size`. For example, with `sp_size=4`, set `pad_to_multiple_of=4` or higher.
+
+```py
+from transformers import DataCollatorForLanguageModeling
+
+data_collator = DataCollatorForLanguageModeling(
+    tokenizer=tokenizer,
+    mlm=False,
+    pad_to_multiple_of=4,  # Ensure sequences are divisible by sp_size
+)
+```
+
+When using `sp_size` with multiple GPUs, you **must** explicitly set `dp_replicate_size` or `dp_shard_size` to ensure `total_size = dp_replicate_size * dp_shard_size * sp_size` equals your total number of GPUs. For example, with 8 GPUs and `sp_size=4`, you must set `dp_replicate_size=2` (since 2 × 1 × 4 = 8):
+
+```py
+parallelism_config = ParallelismConfig(
+    sp_backend="deepspeed",
+    sp_size=4,
+    dp_replicate_size=2,
+    sp_handler=DeepSpeedSequenceParallelConfig(
+        sp_seq_length_is_variable=True,
+        sp_attn_implementation="flash_attention_2",
+    ),
+)
+```
+
+[`Trainer`] automatically handles the special requirements for sequence parallelism including:
+
+* Adapting the data loader via DeepSpeed's [`UlyssesSPDataLoaderAdapter`](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/sequence_parallel/ulysses_sp.py) to shard sequences across GPUs. **Important**: Unlike Tensor Parallelism (where all ranks must receive identical data), each rank with SP receives a unique data stream from the DataLoader (similar to DP). The adapter handles distributing sequence chunks across SP ranks internally, so your DataLoader should continue feeding different samples to each rank.
+* Generating `position_ids` when not provided
+* Creating `shift_labels` for causal language modeling
+* Aggregating loss across sequence parallel ranks with proper masking for `-100` labels
+
+You can launch training with sequence parallelism using the `accelerate launch` command.
+
+```bash
+accelerate launch --config_file alst_config.yaml your_training_script.py \
+--output_dir output_dir \
+--per_device_train_batch_size 1 \
+--gradient_accumulation_steps 1
+```
+
 ## Training features
 
 DeepSpeed supports many training features that can be configured in the config file. This section describes some of the most important features.
 
@@ -359,7 +359,10 @@ def convert(self, layer_name: str, config=None, quantizer=None, missing_keys: Op
         return collected_tensors, misc
 
 
-GLOBAL_WORKERS = min(16, (os.cpu_count() or 8) * 2)  # NVMe: 8-16; HDD/NFS: 2-4
+# For I/O bound operations (i.e. here reading files), it is better to have fewer threads, e.g. 4 is a good default.
+# Having too many is actually harming performances quite a lot, i.e. using 16 can sometimes lead to taking TWICE
+# as much time to load the same model
+GLOBAL_WORKERS = min(4, os.cpu_count() or 4)
 
 
 def _materialize_copy(tensor, device=None, dtype=None):
@@ -610,7 +613,7 @@ def convert_and_load_state_dict_in_model(
     tp_plan = tp_plan or {}
     device_map = device_map or {"": "cpu"}
     device_map_regex = re.compile(
-        "|".join(rf"({k})" for k in sorted(device_map.keys(), key=lambda x: x.count("."), reverse=True))
+        "|".join(rf"({k})" for k in sorted(device_map.keys(), key=lambda x: (x.count("."), len(x)), reverse=True))
     )
     dtype_plan = dtype_plan or {}
     weight_mapping = weight_mapping or []
 
@@ -340,9 +340,6 @@ def sdpa_mask(
         allow_is_causal_skip (`bool`, optional):
             Whether to allow to return `None` for the mask under conditions where we can use the `is_causal` argument in
             `torch.sdpa` instead. Default to `True`.
-        allow_torch_fix (`bool`, optional):
-            Whether to update the mask in case a query is not attending to any tokens, to solve a bug in torch's older
-            versions. We need an arg to skip it when using eager. By default `True`.
         allow_is_bidirectional_skip (`bool`, optional):
             Whether to allow to return `None` for the mask under conditions where we do not have to add any bias,
             i.e. full attention without any padding. Default to `False`.
@@ -480,6 +477,7 @@ def eager_mask(
     mask_function: Callable = causal_mask_function,
     attention_mask: Optional[torch.Tensor] = None,
     dtype: torch.dtype = torch.float32,
+    allow_is_bidirectional_skip: bool = False,
     use_vmap: bool = False,
     **kwargs,
 ) -> torch.Tensor:
@@ -503,13 +501,15 @@ def eager_mask(
             The 2D attention mask corresponding to padded tokens of shape (batch_size, number_of_seen_tokens+q_length)
         dtype (`torch.dtype`, optional):
             The dtype to use for the mask. By default, `torch.float32`.
+        allow_is_bidirectional_skip (`bool`, optional):
+            Whether to allow to return `None` for the mask under conditions where we do not have to add any bias,
+            i.e. full attention without any padding. Default to `False`.
         use_vmap (`bool`, optional):
             Whether to use `vmap` during the mask construction or not. Allows powerful custom patterns that may not be
             index-based (for the cost of speed performance). By default `False`.
     """
     # The masks for eager attention are simply boolean mask from sdpa, casted to 0 and -inf
     _ = kwargs.pop("allow_is_causal_skip", None)
-    _ = kwargs.pop("allow_is_bidirectional_skip", None)
     _ = kwargs.pop("allow_torch_fix", None)
     mask = sdpa_mask(
         batch_size=batch_size,
@@ -519,14 +519,16 @@ def eager_mask(
         mask_function=mask_function,
         attention_mask=attention_mask,
         allow_is_causal_skip=False,
-        allow_is_bidirectional_skip=False,
+        allow_is_bidirectional_skip=allow_is_bidirectional_skip,
         allow_torch_fix=False,
         use_vmap=use_vmap,
         **kwargs,
     )
-    min_dtype = torch.finfo(dtype).min
-    # we need 0s where the tokens should be taken into account, and -inf otherwise (mask is already of boolean type)
-    mask = torch.where(mask, torch.tensor(0.0, device=mask.device, dtype=dtype), min_dtype)
+    # only bidirectional masks can be skipped, otherwise we convert bool -> float
+    if mask is not None:
+        min_dtype = torch.finfo(dtype).min
+        # we need 0s where the tokens should be taken into account, and -inf otherwise (mask is already of boolean type)
+        mask = torch.where(mask, torch.tensor(0.0, device=mask.device, dtype=dtype), min_dtype)
     return mask
 
 
 
@@ -24,6 +24,7 @@
     is_flash_attn_3_available,
     is_flash_attn_greater_or_equal_2_10,
     is_torch_npu_available,
+    is_torch_xpu_available,
     logging,
 )
 
@@ -45,7 +46,12 @@ def flash_attn_supports_top_left_mask():
 
 # TODO Deprecate when all models have the attention interface
 def is_flash_attn_available():
-    return is_flash_attn_3_available() or is_flash_attn_2_available() or is_torch_npu_available()
+    return (
+        is_flash_attn_3_available()
+        or is_flash_attn_2_available()
+        or is_torch_npu_available()
+        or is_torch_xpu_available()
+    )
 
 
 # `globals()` is not compatible with dynamo, hence we have do define them in global scope ourselves
@@ -97,7 +103,7 @@ def _lazy_imports(implementation: Optional[str]):
             if flash_attn_varlen_func is None or flash_attn_func is None:
                 raise ValueError(
                     f"Could not find the currently requested flash attention implementation at `{implementation}`."
-                    f"Make sure that you request a valid kernel from the hub, e.g. `kernels-community/flash-attn`."
+                    f"Make sure that you request a valid kernel from the hub, e.g. `kernels-community/flash-attn2`."
                 )
 
     return flash_attn_func, flash_attn_varlen_func, pad_input, unpad_input