.. _attention-implementation-override: Attention Implementation Override ================================== Last updated: 10/31/2025. By default, VERL's FSDP workers use ``flash_attention_2`` as the attention implementation for improved performance. However, you can now override this setting to use different attention implementations based on your needs. Supported Attention Implementations ----------------------------------- The following attention implementations are supported (subject to model and hardware compatibility): - ``flash_attention_2``: High-performance attention implementation (default) - ``eager``: Standard PyTorch attention implementation - ``sdpa``: Scaled Dot-Product Attention (PyTorch native) When to Override ---------------- You might want to override the attention implementation in the following scenarios: - **Debugging**: Use ``eager`` for easier debugging and better error messages - **Compatibility**: Some models or hardware configurations may not support ``flash_attention_2`` - **Memory constraints**: Different implementations have different memory characteristics - **Performance tuning**: Testing different implementations for optimal performance Configuration Examples ----------------------- PPO Training with Eager Attention ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To override the attention implementation for the actor, rollout, and reference models: .. code:: bash python3 ppo_trainer.py \ +actor_rollout_ref.model.override_config.attn_implementation=eager \ [other parameters...] PPO Training with SDPA Attention ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: bash python3 ppo_trainer.py \ +actor_rollout_ref.model.override_config.attn_implementation=sdpa \ [other parameters...] Critic Model Override ~~~~~~~~~~~~~~~~~~~~~ For training configurations that include a critic model, you can also override its attention implementation: .. code:: bash python3 ppo_trainer.py \ +actor_rollout_ref.model.override_config.attn_implementation=eager \ +critic.model.override_config.attn_implementation=eager \ [other parameters...] YAML Configuration ~~~~~~~~~~~~~~~~~~ You can also specify the attention implementation in your YAML configuration file: .. code:: yaml actor_rollout_ref: model: override_config: attn_implementation: eager # other overrides... critic: # if using a critic model model: override_config: attn_implementation: eager # other overrides... Important Notes --------------- **Backward Compatibility**: If you don't specify ``attn_implementation`` in the override config, VERL will continue to use ``flash_attention_2`` by default, ensuring backward compatibility with existing configurations. **Model Support**: Not all models support all attention implementations. Ensure your model is compatible with the chosen attention implementation before training. **Performance Impact**: Different attention implementations have varying performance characteristics. ``flash_attention_2`` typically offers the best performance, while ``eager`` provides better debugging capabilities. **Hardware Dependencies**: Some attention implementations (like ``flash_attention_2``) may require specific hardware or CUDA versions. If you encounter compatibility issues, try using ``eager`` or ``sdpa``. Troubleshooting --------------- If you encounter errors when using a specific attention implementation: 1. **Check model compatibility**: Verify that your model supports the chosen attention implementation 2. **Try eager attention**: Use ``attn_implementation=eager`` as a fallback for debugging 3. **Check hardware requirements**: Ensure your hardware supports the attention implementation 4. **Review error messages**: Attention implementation errors often provide clear guidance on supported options Example Error Resolution ~~~~~~~~~~~~~~~~~~~~~~~~ If you see an error like "flash_attention_2 is not supported", you can resolve it by switching to eager attention: .. code:: bash # Instead of the default flash_attention_2 python3 ppo_trainer.py +actor_rollout_ref.model.override_config.attn_implementation=eager This override ensures your training can proceed while you investigate the flash attention compatibility issue.