| .. _attention-implementation-override: |
|
|
| Attention Implementation Override |
| ================================== |
|
|
| Last updated: 10/31/2025. |
|
|
| By default, VERL's FSDP workers use ``flash_attention_2`` as the attention implementation for improved performance. |
| However, you can now override this setting to use different attention implementations based on your needs. |
|
|
| Supported Attention Implementations |
| ----------------------------------- |
|
|
| The following attention implementations are supported (subject to model and hardware compatibility): |
|
|
| - ``flash_attention_2``: High-performance attention implementation (default) |
| - ``eager``: Standard PyTorch attention implementation |
| - ``sdpa``: Scaled Dot-Product Attention (PyTorch native) |
|
|
| When to Override |
| ---------------- |
|
|
| You might want to override the attention implementation in the following scenarios: |
|
|
| - **Debugging**: Use ``eager`` for easier debugging and better error messages |
| - **Compatibility**: Some models or hardware configurations may not support ``flash_attention_2`` |
| - **Memory constraints**: Different implementations have different memory characteristics |
| - **Performance tuning**: Testing different implementations for optimal performance |
|
|
| Configuration Examples |
| ----------------------- |
|
|
| PPO Training with Eager Attention |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| To override the attention implementation for the actor, rollout, and reference models: |
|
|
| .. code:: bash |
|
|
| python3 ppo_trainer.py \ |
| +actor_rollout_ref.model.override_config.attn_implementation=eager \ |
| [other parameters...] |
|
|
| PPO Training with SDPA Attention |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| .. code:: bash |
|
|
| python3 ppo_trainer.py \ |
| +actor_rollout_ref.model.override_config.attn_implementation=sdpa \ |
| [other parameters...] |
|
|
| Critic Model Override |
| ~~~~~~~~~~~~~~~~~~~~~ |
|
|
| For training configurations that include a critic model, you can also override its attention implementation: |
|
|
| .. code:: bash |
|
|
| python3 ppo_trainer.py \ |
| +actor_rollout_ref.model.override_config.attn_implementation=eager \ |
| +critic.model.override_config.attn_implementation=eager \ |
| [other parameters...] |
|
|
| YAML Configuration |
| ~~~~~~~~~~~~~~~~~~ |
|
|
| You can also specify the attention implementation in your YAML configuration file: |
|
|
| .. code:: yaml |
|
|
| actor_rollout_ref: |
| model: |
| override_config: |
| attn_implementation: eager |
| # other overrides... |
|
|
| critic: # if using a critic model |
| model: |
| override_config: |
| attn_implementation: eager |
| # other overrides... |
|
|
| Important Notes |
| --------------- |
|
|
| **Backward Compatibility**: If you don't specify ``attn_implementation`` in the override config, |
| VERL will continue to use ``flash_attention_2`` by default, ensuring backward compatibility with existing configurations. |
|
|
| **Model Support**: Not all models support all attention implementations. Ensure your model is compatible |
| with the chosen attention implementation before training. |
|
|
| **Performance Impact**: Different attention implementations have varying performance characteristics. |
| ``flash_attention_2`` typically offers the best performance, while ``eager`` provides better debugging capabilities. |
|
|
| **Hardware Dependencies**: Some attention implementations (like ``flash_attention_2``) may require |
| specific hardware or CUDA versions. If you encounter compatibility issues, try using ``eager`` or ``sdpa``. |
|
|
| Troubleshooting |
| --------------- |
|
|
| If you encounter errors when using a specific attention implementation: |
|
|
| 1. **Check model compatibility**: Verify that your model supports the chosen attention implementation |
| 2. **Try eager attention**: Use ``attn_implementation=eager`` as a fallback for debugging |
| 3. **Check hardware requirements**: Ensure your hardware supports the attention implementation |
| 4. **Review error messages**: Attention implementation errors often provide clear guidance on supported options |
|
|
| Example Error Resolution |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| If you see an error like "flash_attention_2 is not supported", you can resolve it by switching to eager attention: |
|
|
| .. code:: bash |
|
|
| # Instead of the default flash_attention_2 |
| python3 ppo_trainer.py +actor_rollout_ref.model.override_config.attn_implementation=eager |
|
|
| This override ensures your training can proceed while you investigate the flash attention compatibility issue. |
|
|