huggingface
diff --git a/‎docs/source/en/deepspeed.md‎
Lines changed: 102 additions & 0 deletions b/‎docs/source/en/deepspeed.md‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/pix2struct.md‎
Lines changed: 5 additions & 0 deletions b/‎docs/source/en/model_doc/pix2struct.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎docs/source/en/model_doc/sam3_video.md‎
Lines changed: 33 additions & 0 deletions b/‎docs/source/en/model_doc/sam3_video.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎src/transformers/cli/add_fast_image_processor.py‎
Lines changed: 0 additions & 2 deletions b/‎src/transformers/cli/add_fast_image_processor.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎src/transformers/conversion_mapping.py‎
Lines changed: 8 additions & 13 deletions b/‎src/transformers/conversion_mapping.py‎
Lines changed: 8 additions & 13 deletions
@@ -368,6 +368,108 @@ The example ZeRO-3 and ZeRO-Infinity config below sets most of the parameter val
 }
 ```
 
+### Sequence Parallelism
+
+DeepSpeed's ALST/Ulysses sequence parallelism enables training with very long sequences by splitting the sequence across multiple GPUs. This is particularly useful for training large language models with very long sequence lengths.
+
+Arctic Long Sequence Training (ALST) uses a combination of sharding inputs along the sequence dimension and attention head parallelism. With this approach, you can train models with sequence lengths up to 500K tokens on a single H100 GPU, 3.7M on a single node, or 15M tokens on just four nodes with Llama-8B. The implementation described here enables one component of the full ALST system. For additional optimizations like TiledMLP and activation checkpoint offloading, refer to the [DeepSpeed ALST tutorial](https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/).
+
+> [!TIP]
+> For more detailed information about sequence parallelism, see the Accelerate [Sequence Parallelism](https://huggingface.co/docs/accelerate/concept_guides/sequence_parallelism) guide.
+
+To enable ALST/Ulysses sequence parallelism with [`Trainer`], configure `parallelism_config` in [`TrainingArguments`]. Sequence parallelism is configured via Accelerate's `ParallelismConfig` and requires an Accelerate version higher than 1.12.0.
+
+```py
+from accelerate.utils import ParallelismConfig, DeepSpeedSequenceParallelConfig
+
+# Example: 4 GPUs with sp_size=4, dp_replicate_size=1 (no data parallelism)
+# Ensure total_size = dp_replicate_size * dp_shard_size * sp_size = 1 * 1 * 4 = 4 GPUs
+parallelism_config = ParallelismConfig(
+    sp_backend="deepspeed",
+    sp_size=4,  # Number of GPUs to split sequence across
+    dp_replicate_size=1,  # Explicit: no data parallelism
+    sp_handler=DeepSpeedSequenceParallelConfig(
+        sp_seq_length_is_variable=True,
+        sp_attn_implementation="sdpa",
+    ),
+)
+
+training_args = TrainingArguments(
+    ...,
+    deepspeed="path/to/deepspeed_config.json",
+    parallelism_config=parallelism_config,
+)
+```
+
+You can also configure sequence parallelism using an Accelerate config file.
+
+```yaml
+distributed_type: DEEPSPEED
+deepspeed_config:
+  deepspeed_config_file: path/to/ds_config.json
+machine_rank: 0
+num_machines: 1
+num_processes: 4  # Total number of processes
+parallelism_config:
+  parallelism_config_sp_size: 4  # Sequence parallel size
+  parallelism_config_dp_replicate_size: 1  # Must be: dp_replicate_size * dp_shard_size * sp_size = num_processes
+  parallelism_config_sp_backend: deepspeed
+  parallelism_config_sp_seq_length_is_variable: true
+  parallelism_config_sp_attn_implementation: sdpa
+```
+
+Important configuration parameters include the following.
+
+* `sp_backend` must be set to `"deepspeed"` to use ALST/Ulysses sequence parallelism.
+* `sp_size` is the degree of sequence parallelism. For example, `sp_size=4` means 4 GPUs will process a single sequence in parallel. You need at least 2 GPUs to enable sequence parallelism. **Data feeding**: Each rank receives a unique data stream from the DataLoader (like DP). **Batch size calculation**: The effective `dp_world_size = world_size / sp_size`. So with 4 GPUs and `sp_size=4`, each of the 4 ranks gets different samples from the DataLoader, but `dp_world_size=1` for total batch size calculations
+* `sp_seq_length_is_variable` determines how sequence lengths are handled. When set to `True` (recommended), the implementation adapts to varying sequence lengths between batches. When `False`, all sequences must be padded to a fixed length specified by `sp_seq_length`.
+* `sp_attn_implementation` specifies the attention implementation to use. Supported values are `"sdpa"`, `"flash_attention_2"`, or `"flash_attention_3"`. Flash Attention is recommended for best performance, especially with multiple samples in a batch, because SDPA may incorrectly attend across sample boundaries.
+
+> [!WARNING]
+> Sequence parallelism requires your model to use one of the supported attention implementations (`sdpa`, `flash_attention_2`, or `flash_attention_3`). The `eager` attention implementation is not supported because it doesn't properly handle `position_ids`.
+
+When using sequence parallelism, ensure your sequences are properly padded. Use `pad_to_multiple_of` in your data collator to ensure sequences are divisible by `sp_size`. For example, with `sp_size=4`, set `pad_to_multiple_of=4` or higher.
+
+```py
+from transformers import DataCollatorForLanguageModeling
+
+data_collator = DataCollatorForLanguageModeling(
+    tokenizer=tokenizer,
+    mlm=False,
+    pad_to_multiple_of=4,  # Ensure sequences are divisible by sp_size
+)
+```
+
+When using `sp_size` with multiple GPUs, you **must** explicitly set `dp_replicate_size` or `dp_shard_size` to ensure `total_size = dp_replicate_size * dp_shard_size * sp_size` equals your total number of GPUs. For example, with 8 GPUs and `sp_size=4`, you must set `dp_replicate_size=2` (since 2 × 1 × 4 = 8):
+
+```py
+parallelism_config = ParallelismConfig(
+    sp_backend="deepspeed",
+    sp_size=4,
+    dp_replicate_size=2,
+    sp_handler=DeepSpeedSequenceParallelConfig(
+        sp_seq_length_is_variable=True,
+        sp_attn_implementation="flash_attention_2",
+    ),
+)
+```
+
+[`Trainer`] automatically handles the special requirements for sequence parallelism including:
+
+* Adapting the data loader via DeepSpeed's [`UlyssesSPDataLoaderAdapter`](https://github.com/deepspeedai/DeepSpeed/blob/master/deepspeed/runtime/sequence_parallel/ulysses_sp.py) to shard sequences across GPUs. **Important**: Unlike Tensor Parallelism (where all ranks must receive identical data), each rank with SP receives a unique data stream from the DataLoader (similar to DP). The adapter handles distributing sequence chunks across SP ranks internally, so your DataLoader should continue feeding different samples to each rank.
+* Generating `position_ids` when not provided
+* Creating `shift_labels` for causal language modeling
+* Aggregating loss across sequence parallel ranks with proper masking for `-100` labels
+
+You can launch training with sequence parallelism using the `accelerate launch` command.
+
+```bash
+accelerate launch --config_file alst_config.yaml your_training_script.py \
+--output_dir output_dir \
+--per_device_train_batch_size 1 \
+--gradient_accumulation_steps 1
+```
+
 ## Training features
 
 DeepSpeed supports many training features that can be configured in the config file. This section describes some of the most important features.
 
@@ -65,6 +65,11 @@ The original code can be found [here](https://github.com/google-research/pix2str
 [[autodoc]] Pix2StructImageProcessor
     - preprocess
 
+## Pix2StructImageProcessorFast
+
+[[autodoc]] Pix2StructImageProcessorFast
+    - preprocess
+
 ## Pix2StructTextModel
 
 [[autodoc]] Pix2StructTextModel
 
@@ -97,6 +97,39 @@ Processed 51 frames
 >>> print(f"Masks shape: {frame_0_outputs['masks'].shape}")
 ```
 
+You can also track multiple object categories simultaneously by providing multiple prompts. The model efficiently reuses vision features across all prompts:
+
+```python
+>>> # Add multiple text prompts (or use a list in add_text_prompt)
+>>> multi_prompt_session = processor.init_video_session(
+...     video=video_frames,
+...     inference_device=device,
+...     processing_device="cpu",
+...     video_storage_device="cpu",
+...     dtype=torch.bfloat16,
+... )
+>>>
+>>> prompts = ["person", "bed", "lamp"]
+>>> processor.add_text_prompt(multi_prompt_session, prompts)
+>>>
+>>> # Process video - detects objects from ALL prompts in a single pass
+>>> multi_outputs_per_frame = {}
+>>> for model_outputs in model.propagate_in_video_iterator(
+...     inference_session=multi_prompt_session, max_frame_num_to_track=50
+... ):
+...     processed_outputs = processor.postprocess_outputs(multi_prompt_session, model_outputs)
+...     multi_outputs_per_frame[model_outputs.frame_idx] = processed_outputs
+>>>
+>>> # Check which objects were detected by each prompt
+>>> frame_0_outputs = multi_outputs_per_frame[0]
+>>> prompt_to_obj_ids = frame_0_outputs["prompt_to_obj_ids"]
+>>> for prompt, obj_ids in prompt_to_obj_ids.items():
+...     print(f"{prompt}: {len(obj_ids)} objects")
+person: 2 objects
+bed: 1 objects
+lamp: 1 objects
+```
+
 #### Streaming Video Inference
 
 <div class="warning">
 
@@ -59,8 +59,6 @@ def add_fast_image_processor(
     image_processor_name = re.findall(r"class (\w*ImageProcessor)", content_base_file)
     if not image_processor_name:
         raise ValueError(f"No ImageProcessor class found in {image_processing_module_file}")
-    elif len(image_processor_name) > 1:
-        raise ValueError(f"Multiple ImageProcessor classes found in {image_processing_module_file}")
 
     image_processor_name = image_processor_name[0]
     fast_image_processor_name = image_processor_name + "Fast"
 
@@ -15,7 +15,7 @@
 
 from copy import deepcopy
 
-from .core_model_loading import Concatenate, MergeModulelist, WeightConverter
+from .core_model_loading import Concatenate, MergeModulelist, WeightConverter, WeightRenaming
 from .utils import is_torch_available
 
 
@@ -26,6 +26,7 @@
 def _build_checkpoint_conversion_mapping():
     mapping = {
         "mixtral": [
+            WeightRenaming(".block_sparse_moe.gate", ".mlp.gate"),
             WeightConverter(
                 source_keys=[
                     "block_sparse_moe.experts.*.w1.weight",
@@ -50,12 +51,6 @@ def _build_checkpoint_conversion_mapping():
                     ),  # each process has two lists of tensors, we cat each list. -> we end up with 2 tensors
                 ],  # we want the loading to add this shard operation here. Though we can't shard after concats and merge, needs to be first
             ),
-            # WeightConverter(
-            #     ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
-            #     "self_attn.qkv_proj",
-            #     operations=[Concatenate(dim=0)],  # more like stack?
-            # ),
-            WeightConverter("*.block_sparse_moe.", "*.mlp."),
         ],
         "qwen2_moe": [
             WeightConverter(
@@ -73,34 +68,34 @@ def _build_checkpoint_conversion_mapping():
             ),
         ],
         "legacy": [
-            WeightConverter(
+            WeightRenaming(
                 source_keys="LayerNorm.gamma",
                 target_keys="LayerNorm.weight",
             ),
-            WeightConverter(
+            WeightRenaming(
                 source_keys="LayerNorm.beta",
                 target_keys="LayerNorm.bias",
             ),
         ],
     }
     if hasattr(torch.nn.utils.parametrizations, "weight_norm"):
         mapping["legacy"] += [
-            WeightConverter(
+            WeightRenaming(
                 source_keys="weight_g",
                 target_keys="parametrizations.weight.original0",
             ),
-            WeightConverter(
+            WeightRenaming(
                 source_keys="weight_v",
                 target_keys="parametrizations.weight.original1",
             ),
         ]
     else:
         mapping["legacy"] += [
-            WeightConverter(
+            WeightRenaming(
                 source_keys="parametrizations.weight.original0",
                 target_keys="weight_g",
             ),
-            WeightConverter(
+            WeightRenaming(
                 source_keys="parametrizations.weight.original1",
                 target_keys="weight_v",
             ),