# Distributing Training > [!WARNING] > Section under construction. Feel free to contribute! ## Multi-GPU Training with TRL The trainers in TRL use [🤗 Accelerate](https://github.com/huggingface/accelerate) to enable distributed training across multiple GPUs or nodes. To do so, first create an [🤗 Accelerate](https://github.com/huggingface/accelerate) config file by running ```bash accelerate config ``` and answering the questions according to your multi-GPU / multi-node setup. You can then launch distributed training by running: ```bash accelerate launch train.py ``` We also provide config files in the [examples folder](https://github.com/huggingface/trl/tree/main/examples/accelerate_configs) that can be used as templates. To use these templates, simply pass the path to the config file when launching a job, e.g.: ```shell accelerate launch --config_file examples/accelerate_configs/multi_gpu.yaml train.py ``` This automatically distributes the workload across all available GPUs. Under the hood, [🤗 Accelerate](https://github.com/huggingface/accelerate) creates one model per GPU. Each process: - Processes its own batch of data - Computes the loss and gradients for that batch - Shares gradient updates across all GPUs ![multi gpu](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/multi_gpu.png) The effective batch size is calculated as: $$ \text{Batch Size} = \text{per\_device\_train\_batch\_size} \times \text{num\_devices} \times \text{gradient\_accumulation\_steps} $$ To maintain a consistent batch size when scaling to multiple GPUs, make sure to update `per_device_train_batch_size` and `gradient_accumulation_steps` accordingly. Example, these configurations are equivalent, and should yield the same results: | Number of GPUs | Per device batch size | Gradient accumulation steps | Comments | | --- | --- | --- | --- | | 1 | 32 | 1 | Possibly high memory usage, but faster training | | 1 | 4 | 8 | Lower memory usage, slower training | | 8 | 4 | 1 | Multi-GPU to get the best of both worlds | > [!TIP] > Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details. ## Sequence Parallelism for Long Context Training Sequence Parallelism (also called Context Parallelism) is a parallelization technique that enables training with longer sequences by splitting the sequence dimension across multiple GPUs. Each GPU processes a portion of the sequence, allowing you to train with sequences longer than what would fit on a single GPU's memory. > [!NOTE] > **Terminology clarification:** This section describes parallelism techniques for splitting sequences to enable longer context training: > - **Context Parallelism (CP)**: Splits sequences across GPUs (implemented as Ring Attention with FSDP2) > - **Sequence Parallelism (SP)**: Another form of sequence splitting (implemented as ALST/Ulysses with DeepSpeed) > > Both CP and SP are different from traditional Sequence Parallelism used with Tensor Parallelism (TP+SP) to reduce activation memory. With the techniques here, parallelism dimensions multiply: `TP=2` and `CP=2` would require 4 GPUs (2×2), whereas traditional `TP+SP=2` only needs 2 GPUs as they share the same ranks. > > In Accelerate's `ParallelismConfig`: > - Use `cp_size` with `cp_backend="torch"` for Ring Attention (FSDP2) > - Use `sp_size` with `sp_backend="deepspeed"` for ALST/Ulysses (DeepSpeed) Sequence parallelism is particularly useful when: - You want to train with very long sequences (>32k tokens) - Single GPU memory is insufficient for your desired sequence length - You need to maintain sequence coherence across the full context ### Available Implementations TRL supports two sequence parallelism implementations, each with different characteristics: 1. **Ring Attention (FSDP2)** - Uses ring-based communication for memory-efficient processing of extremely long sequences 2. **ALST/Ulysses (DeepSpeed)** - Uses attention head parallelism for faster training with high-bandwidth interconnects > [!IMPORTANT] > **Sequence Length Terminology:** When using Context Parallelism, the sequence is split across GPUs, introducing two concepts: > - **Global sequence length**: The full sequence length before splitting across GPUs > - **Micro sequence length**: The sequence length per GPU after splitting > > In TRL, `max_seq_length` (or `max_length`) refers to the **global sequence length**. The framework automatically handles splitting into micro sequences: > - **Ring Attention (FSDP2)**: Uses `cp_size` to split sequences. With `max_seq_length=8192` and `cp_size=4`, each GPU processes 2048 tokens. > - **ALST/Ulysses (DeepSpeed)**: Uses `sp_size` (with `sp_backend="deepspeed"`) to split sequences. With `max_seq_length=8192` and `sp_size=2`, each GPU processes 4096 tokens. > > The Trainer automatically accounts for context parallelism when calculating batch sizes and training metrics. ### Choosing Between Ring Attention and Ulysses The comparison table below highlights the key differences between the two approaches: | Feature | Ring Attention (FSDP2) | ALST/Ulysses (DeepSpeed) | |---------|----------|-------------------------| | **Method** | Ring Self-Attention | Attention Head Parallelism | | **Backend** | PyTorch FSDP2 | DeepSpeed ZeRO | | **Attention** | SDPA only | Flash Attention 2 or SDPA | | **Minimum Accelerate** | 1.11.0+ | 1.12.0+ | | **Minimum DeepSpeed** | N/A | 0.18.1+ | | **Sequence Divisibility** | `cp_size * 2` | `sp_size` | | **Zero Stage** | N/A | ZeRO Stage 1/2/3 | **Ring Attention is better when:** - You need to handle extremely long sequences (1M+ tokens) - The model has limited attention heads (Ring Attention is not constrained by head count) - You want flexibility in scaling to any sequence length - Network topology is limited (Ring Attention works with simple P2P ring communication) **Ulysses is better when:** - You have high-bandwidth, low-latency interconnects (NVLink, InfiniBand) - The model has many attention heads that can be split across GPUs - You want lower communication volume - You want faster training speed for moderate sequence lengths (up to ~500k tokens) **Key Trade-offs:** - **Communication Volume:** Ulysses has lower communication volume, making it more efficient with good interconnects. Ring Attention has higher communication volume but is more flexible with different network topologies. - **Attention Head Constraints:** Ulysses is limited by the number of attention heads (requires `num_heads >= sp_size`). Ring Attention scales with sequence length regardless of model architecture. - **Network Sensitivity:** Ulysses all-to-all communication is sensitive to network latency. Ring Attention uses P2P ring communication which is more tolerant of varying network conditions. For a detailed comparison, see the [Ulysses and Ring Attention blog post](https://huggingface.co/blog/exploding-gradients/ulysses-ring-attention). ### Ring Attention Implementation (FSDP2) Ring Attention uses a ring-like communication pattern where each GPU processes a portion of the sequence and passes information to the next GPU in the ring. #### Requirements and Limitations 1. **Accelerate 1.11.0 or higher** is required for Ring Attention / Context Parallelism support 2. **FSDP2 (PyTorch FSDP v2)** is required as the distributed training backend 3. **SDPA attention** - Flash Attention is currently not supported 4. **Sequence length divisibility** - sequences must be divisible by `cp_size * 2`. This is automatically handled using the `pad_to_multiple_of` parameter in the data collator. #### Configuration ##### Accelerate Configuration Use one of the provided accelerate config files (e.g. [`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml) for 2 GPUs): ```yaml compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' enable_cpu_affinity: false fsdp_config: fsdp_activation_checkpointing: true # Enable activation checkpointing for memory efficiency fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_cpu_ram_efficient_loading: true fsdp_offload_params: false fsdp_reshard_after_forward: true fsdp_state_dict_type: FULL_STATE_DICT fsdp_version: 2 machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 # Number of GPUs rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false parallelism_config: parallelism_config_dp_replicate_size: 1 parallelism_config_dp_shard_size: 1 parallelism_config_tp_size: 1 parallelism_config_cp_size: 2 # Context parallel size ``` ##### Training Configuration ```python from trl import SFTConfig training_args = SFTConfig( # required pad_to_multiple_of=4, # ensures divisibility by cp_size * 2 # to get the most out of CP max_length=16384, # long sequence length packing=True, # use packing to reduce padding use_liger_kernel=True, # compatible with CP gradient_checkpointing=False, # The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg can't be set to True simultaneously per_device_train_batch_size=1, ... ) ``` Then, launch your training script with the appropriate accelerate config file: ```bash accelerate launch --config_file context_parallel_2gpu.yaml train.py ``` #### Best Practices 1. **Use the `pad_to_multiple_of` parameter** - This is now the recommended way to ensure sequence length divisibility: - For `cp_size=2`: use `pad_to_multiple_of=4` (since `cp_size * 2 = 4`) - For `cp_size=4`: use `pad_to_multiple_of=8` (since `cp_size * 2 = 8`) - The data collator automatically pads sequences to the required multiple, ensuring compatibility with CP 2. **Use packing with padding** - The default BFD (Best Fit Decreasing) strategy works perfectly: - Preserves sequence boundaries and maintains training quality - Works seamlessly with both `padding_free=True` and standard padding modes 3. **Combine with other memory optimizations** like Liger kernels, bfloat16, and gradient checkpointing 4. **Start with smaller context parallel sizes** (2-4 GPUs) before scaling up 5. **Monitor memory usage** across all GPUs to ensure balanced workload #### Benchmarking Ring Attention We benchmarked Ring Attention to highlight its potential improvements in training efficiency. Our experiments were conducted using **1, 2, 4, and 8 H100 GPUs**, though the results can be extended to larger clusters with more nodes and GPUs. For the setup, we fine-tuned an **8B model** ([Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)) using the provided accelerate configuration ([`context_parallel_2gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/context_parallel_2gpu.yaml)). We adjusted `num_processes` and `parallelism_config_cp_size` based on the number of GPUs for each run. Training was performed with the [sft.py](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) example script, combined with the parameters described above. The results below summarize the **maximum trainable sequence length** and **iterations per second** for different numbers of GPUs. A value marked as `OOM` indicates that the configuration ran out of memory and could not be trained. These results show that **Context Parallelism (CP) scales effectively with more GPUs**, enabling training on much longer sequences. With **8 GPUs**, context lengths of over **300k tokens** become feasible, unlocking training with extremely long contexts while maintaining reasonable throughput.
CP Max content length CP seconds/iteration
> [!TIP] > Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs. > > You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism). ### ALST/Ulysses Implementation (DeepSpeed) ALST (Arctic Long Sequence Training) / Ulysses uses attention head parallelism to split long sequences across GPUs, working with DeepSpeed's ZeRO optimizer. > [!NOTE] > **Technical Note on Parallelism Configuration:** > - **DeepSpeed ALST/Ulysses** uses `sp_size` with `sp_backend="deepspeed"` in both YAML and Python API > - **Ring Attention (FSDP2)** uses `cp_size` with `cp_backend="torch"` > > The Trainer automatically accounts for both CP and SP when calculating effective batch sizes and training metrics. #### Requirements and Limitations 1. **DeepSpeed 0.18.1 or higher** is required 2. **Accelerate 1.12.0 or higher** is required for ALST/Ulysses sequence parallelism support 3. **Attention implementation** - Flash Attention 2 recommended (clean output), SDPA works as fallback 4. **Sequence length divisibility** - sequences must be divisible by `sp_size`. Use `pad_to_multiple_of` in your training config. 5. **Parallelism configuration** - You must ensure `dp_replicate_size × dp_shard_size × sp_size = num_processes` #### Configuration ##### Accelerate Configuration Use the provided accelerate config file ([`alst_ulysses_4gpu.yaml`](https://github.com/huggingface/trl/blob/main/examples/accelerate_configs/alst_ulysses_4gpu.yaml)): ```yaml compute_environment: LOCAL_MACHINE debug: false deepspeed_config: zero_stage: 3 seq_parallel_communication_data_type: bf16 distributed_type: DEEPSPEED mixed_precision: bf16 num_machines: 1 num_processes: 4 # Number of GPUs parallelism_config: parallelism_config_dp_replicate_size: 1 parallelism_config_dp_shard_size: 2 # Enables 2D parallelism with SP parallelism_config_tp_size: 1 parallelism_config_sp_size: 2 # Sequence parallel size parallelism_config_sp_backend: deepspeed parallelism_config_sp_seq_length_is_variable: true parallelism_config_sp_attn_implementation: flash_attention_2 ``` ##### Training Configuration ```python from trl import SFTConfig training_args = SFTConfig( # required pad_to_multiple_of=2, # Must equal sp_size # to get the most out of SP max_seq_length=4096, packing=True, attn_implementation="flash_attention_2", per_device_train_batch_size=1, ... ) ``` Then, launch your training script with the appropriate accelerate config file: ```bash accelerate launch --config_file examples/accelerate_configs/alst_ulysses_4gpu.yaml train.py ``` #### 2D Parallelism The 4 GPU configuration above automatically enables 2D parallelism by combining Data Parallelism (DP) with Sequence Parallelism (SP). With `sp_size=2` and `dp_shard_size=2`, the 4 GPUs are organized as: - 2 sequence parallel groups (processing the same data split across sequences) - 2 data parallel groups (processing different data) To adjust the parallelism for different GPU counts, modify the YAML config: | GPUs | sp_size | dp_shard_size | Use Case | YAML Changes | |------|---------|---------------|----------|--------------| | 4 | 2 | 2 | Balanced - longer sequences + more data | `num_processes: 4`, `sp_size: 2`, `dp_shard_size: 2` | | 4 | 4 | 1 | Pure SP for maximum sequence length | `num_processes: 4`, `sp_size: 4`, `dp_shard_size: 1` | | 8 | 2 | 4 | Large-scale training | `num_processes: 8`, `sp_size: 2`, `dp_shard_size: 4` | #### Best Practices 1. **Use `pad_to_multiple_of`** to ensure sequences are divisible by `sp_size` 2. **Use Flash Attention 2** for clean output (SDPA works but shows packing warnings) 3. **Start with `sp_size=2`** before scaling to larger values 4. **Use DeepSpeed ZeRO Stage 3** for large models 5. **Combine with memory optimizations** like Liger kernels and gradient checkpointing 6. **Validate parallelism config**: Ensure `dp_replicate_size × dp_shard_size × sp_size = num_processes` #### Complete Example Here's how to run ALST/Ulysses training using the built-in [`sft.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py) script with 4 GPUs: ```bash accelerate launch --config_file examples/accelerate_configs/alst_ulysses_4gpu.yaml \ trl/scripts/sft.py \ --model_name_or_path Qwen/Qwen2-0.5B \ --dataset_name trl-lib/Capybara \ --learning_rate 2e-4 \ --max_steps 100 \ --max_seq_length 4096 \ --packing \ --packing_strategy wrapped \ --torch_dtype bfloat16 \ --attn_implementation flash_attention_2 \ --output_dir output-alst-4gpu \ --logging_steps 10 \ --report_to trackio ``` This command automatically: - Configures 2D parallelism (SP=2, DP=2) across 4 GPUs - Uses Flash Attention 2 for clean training - Enables packing with automatic padding to ensure sequence divisibility - Leverages DeepSpeed ZeRO Stage 3 for memory efficiency ### Further Reading #### General Resources - [Hugging Face Blog: Understanding Ulysses and Ring Attention](https://huggingface.co/blog/exploding-gradients/ulysses-ring-attention) - Detailed comparison of Ring Attention vs Ulysses approaches - [Accelerate: Context Parallelism Guide](https://huggingface.co/docs/accelerate/concept_guides/context_parallelism) - [Hugging Face Blog: Enabling Long-Context Training with Sequence Parallelism in Axolotl](https://huggingface.co/blog/axolotl-ai-co/long-context-with-sequence-parallelism-in-axolotl) #### Ring Attention (FSDP2) - [Ultrascale Playbook - Context Parallelism](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=context_parallelism) - [Accelerate Example: 128k Sequence Length](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#context-parallelism-128k-sequence-length) - [Accelerate ND-parallelism Guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism) #### ALST/Ulysses (DeepSpeed) - [DeepSpeed Sequence Parallelism Documentation](https://www.deepspeed.ai/tutorials/ds-sequence/) - [Snowflake Engineering Blog: Arctic Long Sequence Training (ALST)](https://www.snowflake.com/en/engineering-blog/arctic-long-sequence-training-multi-million-token-ai/) ## Multi-Node Training When a single machine doesn't have enough GPUs, TRL can scale training across multiple machines (nodes) using [🤗 Accelerate](https://huggingface.co/docs/accelerate/basic_tutorials/launch#multi-node-training). ### Accelerate Configuration Create an `accelerate` config file (e.g., `multi_node.yaml`) for multi-node training. Key fields: ```yaml compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU num_machines: 2 machine_rank: 0 # 0 for main node, 1 for second node main_process_ip: 10.0.0.1 # IP of rank 0 node main_process_port: 29500 num_processes: 16 # total processes across nodes mixed_precision: bf16 use_cpu: false same_network: true ``` Adjust `num_processes` to match the total number of GPUs across all nodes. > [!NOTE] > Replace `10.0.0.1` with the actual IP address of the rank 0 (main) node. ### Launching #### Option 1: Manual Launch (Non-HPC) Run the following on each node manually: ```bash # Node 0 (main node) accelerate launch --config_file multi_node.yaml --machine_rank 0 train.py # Node 1 accelerate launch --config_file multi_node.yaml --machine_rank 1 train.py ``` #### Option 2: SLURM Launch (HPC Clusters) For clusters using SLURM job scheduler, create a job script (e.g., `slurm_job.sh`): ```bash #!/bin/bash #SBATCH --nodes=2 #SBATCH --gpus-per-node=8 #SBATCH --job-name=trl_multi srun accelerate launch --config_file multi_node.yaml train.py ``` Then submit the job: ```bash sbatch slurm_job.sh ``` SLURM automatically distributes the training across all requested nodes and GPUs, and `srun` configures the necessary environment variables for multi-node communication. **Key SLURM directives:** - `--nodes=2`: Request 2 compute nodes - `--gpus-per-node=8`: Allocate 8 GPUs per node (16 total) - `--job-name`: Label for tracking in the job queue You can combine multi-node with DeepSpeed by setting `distributed_type: DEEPSPEED` and adding a `deepspeed_config` block. See the [DeepSpeed integration guide](https://huggingface.co/docs/trl/en/deepspeed_integration). ### Further Reading - [Accelerate: Launching Scripts](https://huggingface.co/docs/accelerate/basic_tutorials/launch) - [Accelerate: Example Zoo](https://huggingface.co/docs/accelerate/usage_guides/training_zoo) - [SLURM Workload Manager Documentation](https://slurm.schedmd.com/) - For cluster job scheduling