| # FP8 RL in verl |
|
|
| Last updated: 03/05/2026 |
|
|
| verl supports two FP8 modes for accelerating RL training: |
|
|
| | Mode | Training Precision | Rollout Precision | |
| |------|-------------------|-------------------| |
| | **FP8 Rollout Only** | BF16 | FP8 | |
| | **FP8 End-to-End** | FP8 (Megatron) | FP8 (vLLM) | |
|
|
| > [!TIP] |
| > For ready-to-run scripts, see the [low-precision recipe directory](https://github.com/verl-project/verl-recipe/low_precision). |
|
|
| --- |
|
|
| ## FP8 Rollout Only |
|
|
| FP8 rollout-only mode keeps training in BF16 and quantizes rollout inference to FP8. This reduces GPU memory during generation and speeds up rollout without affecting training precision. |
|
|
| ### Implementation |
|
|
| We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning: |
|
|
| 1. **Quantize weights**: Quantize model weights on-the-fly from higher-precision formats to FP8. |
| 2. **Process weights after loading**: For vLLM, we replace the `vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading` function to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights. |
|
|
| ### Support Matrix |
|
|
| - FP8 blockwise quantization for rollout |
| - Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights |
| - Dense models and MoE models |
| - Async rollout interfaces |
| - vLLM 0.10.x & vLLM 0.11 & vLLM 0.12 & SGLang 0.5.5 |
| - FSDP and Megatron training backends |
|
|
| ### Usage |
|
|
| Enable in config file: |
|
|
| ```yaml |
| rollout: |
| quantization: "fp8" |
| ``` |
|
|
| Or via command line: |
|
|
| ```bash |
| actor_rollout_ref.rollout.quantization=fp8 |
| ``` |
|
|
| ### Experiments and Outcomes |
|
|
| #### Qwen3-8B-Base Dense Model |
|
|
| **Configuration** |
| - DAPO recipe. AIME24 online validation. |
| - vLLM(FP8 spmd rollout) + FSDP |
| - Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout. |
| - Prompt batch size 32, n=16. |
| - Rollout batch size: 32\*3*16 |
| - Train_batch_size & ppo_mini_batch_size 32 |
| - Max response length 20K |
| - Token-level TIS, C=2 |
| - 8*H100 |
| - vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9 |
| |
| **Accuracy** |
|  |
| *dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS* |
| |
| Results and observations: |
| - With TIS, FP8 rollout aligns with BF16 |
| - Obvious accuracy drop when TIS is not enabled |
| - Higher mismatch kl but within acceptable range throughout the training |
| |
| |
| **Performance** |
| |
|  |
| *green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm* |
| |
| Results and observations: |
| - FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm |
| - When upgrading to CUDA 12.9, speedup can be up to ~18% |
| |
| #### Qwen3-30B-A3B-Base MoE Model |
| |
| **Configuration** |
| - DAPO recipe. AIME24 online validation. |
| - FP8 async rollout, vLLM+FSDP |
| - Prompt batch size 32 |
| - Rollout batch size: 32\*3*16 |
| - Train_batch_size & ppo_mini_batch_size 32 |
| - Max response length 20K |
| - Token-level TIS, C=2 |
| - 2\*8*H100 |
| - vLLM 0.10.0+CUDA 12.6 |
|
|
| **Accuracy** |
|  |
| *grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS* |
|
|
| Results and observations: |
| - Rollout & training distribution mismatch is in general higher for MoE |
| - Rollout correction required even for BF16 |
| - FP8 rollout with token-level TIS aligns with BF16 |
|
|
|
|
| **Performance** |
|
|
|  |
| *grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS* |
|
|
| Results and observations: |
| - FP8 rollout : over 35% rollout speedup |
| - Expecting more perf gain with CUDA 12.9 |
|
|
| --- |
|
|
| ## FP8 End-to-End (Training + Rollout) |
|
|
| FP8 E2E applies FP8 to the entire RL pipeline: forward/backward passes via Transformer Engine, FP8 optimizer states, and FP8 rollout inference via vLLM. This maximizes memory savings and throughput. |
|
|
| ### Requirements |
|
|
| - **CUDA 12.9+** (required for block-wise FP8 scaling) |
| - **Transformer Engine** with block-wise FP8 support |
| - Environment variable: `NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1` |
|
|
| ### Key Configuration |
|
|
| ```yaml |
| # FP8 training via Transformer Engine |
| actor_rollout_ref.actor.megatron.override_transformer_config: |
| fp8: "hybrid" # FP8 forward + backward; also supports "e4m3" |
| fp8_recipe: "blockwise" # block-wise scaling |
| |
| # FP8 optimizer |
| actor_rollout_ref.actor.optim.override_optimizer_config: |
| fp8_recipe: "blockwise" |
| |
| # FP8 rollout inference (vLLM) |
| actor_rollout_ref.rollout: |
| quantization: fp8 |
| ``` |
|
|
| ### Support Matrix |
|
|
| - Megatron training backend (via Megatron-Bridge) |
| - Verified on Qwen3-30B-A3B and Qwen3-8B |
| - Block-wise FP8 scaling (`fp8_recipe: "blockwise"`) |
|
|
| ### Experiments and Results |
|
|
| #### Qwen3-30B-A3B MoE Model |
|
|
| **Configuration** |
| - DAPO recipe. AIME24 online validation. |
| - Megatron + Megatron-Bridge, FP8 async rollout with vLLM |
| - MoE router in BF16 for both vLLM and Megatron-Core |
| - Prompt batch size 128, n=16 |
| - Max response length 20K |
| - Token-level TIS, C=2 |
| - 2\*8*H100, CUDA 12.9 |
|
|
|  |
| *Orange: BF16, Green: FP8 E2E, Red: FP8 rollout + BF16 training* |
|
|
| Results and observations: |
| - FP8 E2E achieves comparable accuracy to the BF16 baseline, with the two curves closely aligned throughout training. |
| - The training/inference precision mismatch (measured by KL divergence) follows the ordering: FP8 rollout-only > FP8 E2E > BF16 E2E. This is expected, as FP8 E2E maintains consistent precision across both training and inference, resulting in lower distribution mismatch than the FP8 rollout-only setting where training remains in BF16. |
|
|
| --- |
|
|
| ## Citation |
|
|
| For more extensive experiments, ablation studies, and analysis on FP8 reinforcement learning, please refer to our technical report: |
|
|
| ```bibtex |
| @article{qiu2026fp8rl, |
| title={FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning}, |
| author={Qiu, Zhaopeng and Yu, Shuang and Zhang, Jingqi and Zhang, Shuai and Huang, Xue and Yang, Jingyi and Lai, Junjie}, |
| journal={arXiv preprint arXiv:2601.18150}, |
| year={2026}, |
| url={https://arxiv.org/abs/2601.18150} |
| } |
| ``` |
|
|