.. _attention-implementation-override:

Attention Implementation Override
==================================

Last updated: 10/31/2025.

By default, VERL's FSDP workers use ``flash_attention_2`` as the attention implementation for improved performance. 
However, you can now override this setting to use different attention implementations based on your needs.

Supported Attention Implementations
-----------------------------------

The following attention implementations are supported (subject to model and hardware compatibility):

- ``flash_attention_2``: High-performance attention implementation (default)
- ``eager``: Standard PyTorch attention implementation
- ``sdpa``: Scaled Dot-Product Attention (PyTorch native)

When to Override
----------------

You might want to override the attention implementation in the following scenarios:

- **Debugging**: Use ``eager`` for easier debugging and better error messages
- **Compatibility**: Some models or hardware configurations may not support ``flash_attention_2``
- **Memory constraints**: Different implementations have different memory characteristics
- **Performance tuning**: Testing different implementations for optimal performance

Configuration Examples
-----------------------

PPO Training with Eager Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To override the attention implementation for the actor, rollout, and reference models:

.. code:: bash

    python3 ppo_trainer.py \
        +actor_rollout_ref.model.override_config.attn_implementation=eager \
        [other parameters...]

PPO Training with SDPA Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: bash

    python3 ppo_trainer.py \
        +actor_rollout_ref.model.override_config.attn_implementation=sdpa \
        [other parameters...]

Critic Model Override
~~~~~~~~~~~~~~~~~~~~~

For training configurations that include a critic model, you can also override its attention implementation:

.. code:: bash

    python3 ppo_trainer.py \
        +actor_rollout_ref.model.override_config.attn_implementation=eager \
        +critic.model.override_config.attn_implementation=eager \
        [other parameters...]

YAML Configuration
~~~~~~~~~~~~~~~~~~

You can also specify the attention implementation in your YAML configuration file:

.. code:: yaml

    actor_rollout_ref:
      model:
        override_config:
          attn_implementation: eager
          # other overrides...

    critic:  # if using a critic model
      model:
        override_config:
          attn_implementation: eager
          # other overrides...

Important Notes
---------------

**Backward Compatibility**: If you don't specify ``attn_implementation`` in the override config, 
VERL will continue to use ``flash_attention_2`` by default, ensuring backward compatibility with existing configurations.

**Model Support**: Not all models support all attention implementations. Ensure your model is compatible 
with the chosen attention implementation before training.

**Performance Impact**: Different attention implementations have varying performance characteristics. 
``flash_attention_2`` typically offers the best performance, while ``eager`` provides better debugging capabilities.

**Hardware Dependencies**: Some attention implementations (like ``flash_attention_2``) may require 
specific hardware or CUDA versions. If you encounter compatibility issues, try using ``eager`` or ``sdpa``.

Troubleshooting
---------------

If you encounter errors when using a specific attention implementation:

1. **Check model compatibility**: Verify that your model supports the chosen attention implementation
2. **Try eager attention**: Use ``attn_implementation=eager`` as a fallback for debugging
3. **Check hardware requirements**: Ensure your hardware supports the attention implementation
4. **Review error messages**: Attention implementation errors often provide clear guidance on supported options

Example Error Resolution
~~~~~~~~~~~~~~~~~~~~~~~~

If you see an error like "flash_attention_2 is not supported", you can resolve it by switching to eager attention:

.. code:: bash

    # Instead of the default flash_attention_2
    python3 ppo_trainer.py +actor_rollout_ref.model.override_config.attn_implementation=eager

This override ensures your training can proceed while you investigate the flash attention compatibility issue.