You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/deepspeed.md
+102Lines changed: 102 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -368,6 +368,108 @@ The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter val
368
368
}
369
369
```
370
370
371
+
### Sequence Parallelism
372
+
373
+
DeepSpeed's ALST/Ulysses sequence parallelism enables training with very long sequences by splitting the sequence across multiple GPUs. This is particularly useful for training large language models with very long sequence lengths.
374
+
375
+
Arctic Long Sequence Training (ALST) uses a combination of sharding inputs along the sequence dimension and attention head parallelism. With this approach, you can train models with sequence lengths up to 500K tokens on a single H100 GPU, 3.7M on a single node, or 15M tokens on just four nodes with Llama-8B. The implementation described here enables one component of the full ALST system. For additional optimizations like TiledMLP and activation checkpoint offloading, refer to the [DeepSpeed ALST tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/).
376
+
377
+
> [!TIP]
378
+
> For more detailed information about sequence parallelism, see the Accelerate [Sequence Parallelism](https://huggingface.co/docs/accelerate/concept_guides/sequence_parallelism) guide.
379
+
380
+
To enable ALST/Ulysses sequence parallelism with [`Trainer`], configure `parallelism_config` in [`TrainingArguments`]. Sequence parallelism is configured via Accelerate's `ParallelismConfig` and requires an Accelerate version higher than 1.12.0.
381
+
382
+
```py
383
+
from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig
384
+
385
+
# Example: 4 GPUs with sp_size=4, dp_replicate_size=1 (no data parallelism)
Important configuration parameters include the following.
422
+
423
+
*`sp_backend` must be set to `"deepspeed"` to use ALST/Ulysses sequence parallelism.
424
+
*`sp_size` is the degree of sequence parallelism. For example, `sp_size=4` means 4 GPUs will process a single sequence in parallel. You need at least 2 GPUs to enable sequence parallelism. **Data feeding**: Each rank receives a unique data stream from the DataLoader (like DP). **Batch size calculation**: The effective `dp_world_size = world_size / sp_size`. So with 4 GPUs and `sp_size=4`, each of the 4 ranks gets different samples from the DataLoader, but `dp_world_size=1`for total batch size calculations
425
+
*`sp_seq_length_is_variable` determines how sequence lengths are handled. When set to `True` (recommended), the implementation adapts to varying sequence lengths between batches. When `False`, all sequences must be padded to a fixed length specified by `sp_seq_length`.
426
+
*`sp_attn_implementation` specifies the attention implementation to use. Supported values are `"sdpa"`, `"flash_attention_2"`, or `"flash_attention_3"`. Flash Attention is recommended forbest performance, especially with multiple samplesin a batch, because SDPA may incorrectly attend across sample boundaries.
427
+
428
+
> [!WARNING]
429
+
> Sequence parallelism requires your model to use one of the supported attention implementations (`sdpa`, `flash_attention_2`, or `flash_attention_3`). The `eager` attention implementation is not supported because it doesn't properly handle `position_ids`.
430
+
431
+
When using sequence parallelism, ensure your sequences are properly padded. Use `pad_to_multiple_of` in your data collator to ensure sequences are divisible by `sp_size`. For example, with `sp_size=4`, set `pad_to_multiple_of=4` or higher.
432
+
433
+
```py
434
+
from transformers import DataCollatorForLanguageModeling
435
+
436
+
data_collator = DataCollatorForLanguageModeling(
437
+
tokenizer=tokenizer,
438
+
mlm=False,
439
+
pad_to_multiple_of=4, # Ensure sequences are divisible by sp_size
440
+
)
441
+
```
442
+
443
+
When using `sp_size` with multiple GPUs, you **must** explicitly set `dp_replicate_size` or `dp_shard_size` to ensure `total_size = dp_replicate_size * dp_shard_size * sp_size` equals your total number of GPUs. For example, with 8 GPUs and `sp_size=4`, you must set `dp_replicate_size=2` (since 2 × 1 × 4 = 8):
444
+
445
+
```py
446
+
parallelism_config = ParallelismConfig(
447
+
sp_backend="deepspeed",
448
+
sp_size=4,
449
+
dp_replicate_size=2,
450
+
sp_handler=DeepSpeedSequenceParallelConfig(
451
+
sp_seq_length_is_variable=True,
452
+
sp_attn_implementation="flash_attention_2",
453
+
),
454
+
)
455
+
```
456
+
457
+
[`Trainer`] automatically handles the special requirements for sequence parallelism including:
458
+
459
+
* Adapting the data loader via DeepSpeed's [`UlyssesSPDataLoaderAdapter`](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/sequence_parallel/ulysses_sp.py) to shard sequences across GPUs. **Important**: Unlike Tensor Parallelism (where all ranks must receive identical data), each rank with SP receives a unique data stream from the DataLoader (similar to DP). The adapter handles distributing sequence chunks across SP ranks internally, so your DataLoader should continue feeding different samples to each rank.
460
+
* Generating `position_ids` when not provided
461
+
* Creating `shift_labels`for causal language modeling
462
+
* Aggregating loss across sequence parallel ranks with proper masking for`-100` labels
463
+
464
+
You can launch training with sequence parallelism using the `accelerate launch` command.
You can also track multiple object categories simultaneously by providing multiple prompts. The model efficiently reuses vision features across all prompts:
101
+
102
+
```python
103
+
>>># Add multiple text prompts (or use a list in add_text_prompt)
0 commit comments