FP8 RL in verl
Last updated: 03/05/2026
verl supports two FP8 modes for accelerating RL training:
| Mode | Training Precision | Rollout Precision |
|---|---|---|
| FP8 Rollout Only | BF16 | FP8 |
| FP8 End-to-End | FP8 (Megatron) | FP8 (vLLM) |
For ready-to-run scripts, see the low-precision recipe directory.
FP8 Rollout Only
FP8 rollout-only mode keeps training in BF16 and quantizes rollout inference to FP8. This reduces GPU memory during generation and speeds up rollout without affecting training precision.
Implementation
We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning:
- Quantize weights: Quantize model weights on-the-fly from higher-precision formats to FP8.
- Process weights after loading: For vLLM, we replace the
vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loadingfunction to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights.
Support Matrix
- FP8 blockwise quantization for rollout
- Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
- Dense models and MoE models
- Async rollout interfaces
- vLLM 0.10.x & vLLM 0.11 & vLLM 0.12 & SGLang 0.5.5
- FSDP and Megatron training backends
Usage
Enable in config file:
rollout:
quantization: "fp8"
Or via command line:
actor_rollout_ref.rollout.quantization=fp8
Experiments and Outcomes
Qwen3-8B-Base Dense Model
Configuration
- DAPO recipe. AIME24 online validation.
- vLLM(FP8 spmd rollout) + FSDP
- Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
- Prompt batch size 32, n=16.
- Rollout batch size: 32*3*16
- Train_batch_size & ppo_mini_batch_size 32
- Max response length 20K
- Token-level TIS, C=2
- 8*H100
- vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9
Accuracy
dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS
Results and observations:
- With TIS, FP8 rollout aligns with BF16
- Obvious accuracy drop when TIS is not enabled
- Higher mismatch kl but within acceptable range throughout the training
Performance
green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm
Results and observations:
- FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
- When upgrading to CUDA 12.9, speedup can be up to ~18%
Qwen3-30B-A3B-Base MoE Model
Configuration
- DAPO recipe. AIME24 online validation.
- FP8 async rollout, vLLM+FSDP
- Prompt batch size 32
- Rollout batch size: 32*3*16
- Train_batch_size & ppo_mini_batch_size 32
- Max response length 20K
- Token-level TIS, C=2
- 2*8*H100
- vLLM 0.10.0+CUDA 12.6
Accuracy
grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS
Results and observations:
- Rollout & training distribution mismatch is in general higher for MoE
- Rollout correction required even for BF16
- FP8 rollout with token-level TIS aligns with BF16
Performance
grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS
Results and observations:
- FP8 rollout : over 35% rollout speedup
- Expecting more perf gain with CUDA 12.9
FP8 End-to-End (Training + Rollout)
FP8 E2E applies FP8 to the entire RL pipeline: forward/backward passes via Transformer Engine, FP8 optimizer states, and FP8 rollout inference via vLLM. This maximizes memory savings and throughput.
Requirements
- CUDA 12.9+ (required for block-wise FP8 scaling)
- Transformer Engine with block-wise FP8 support
- Environment variable:
NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
Key Configuration
# FP8 training via Transformer Engine
actor_rollout_ref.actor.megatron.override_transformer_config:
fp8: "hybrid" # FP8 forward + backward; also supports "e4m3"
fp8_recipe: "blockwise" # block-wise scaling
# FP8 optimizer
actor_rollout_ref.actor.optim.override_optimizer_config:
fp8_recipe: "blockwise"
# FP8 rollout inference (vLLM)
actor_rollout_ref.rollout:
quantization: fp8
Support Matrix
- Megatron training backend (via Megatron-Bridge)
- Verified on Qwen3-30B-A3B and Qwen3-8B
- Block-wise FP8 scaling (
fp8_recipe: "blockwise")
Experiments and Results
Qwen3-30B-A3B MoE Model
Configuration
- DAPO recipe. AIME24 online validation.
- Megatron + Megatron-Bridge, FP8 async rollout with vLLM
- MoE router in BF16 for both vLLM and Megatron-Core
- Prompt batch size 128, n=16
- Max response length 20K
- Token-level TIS, C=2
- 2*8*H100, CUDA 12.9
Orange: BF16, Green: FP8 E2E, Red: FP8 rollout + BF16 training
Results and observations:
- FP8 E2E achieves comparable accuracy to the BF16 baseline, with the two curves closely aligned throughout training.
- The training/inference precision mismatch (measured by KL divergence) follows the ordering: FP8 rollout-only > FP8 E2E > BF16 E2E. This is expected, as FP8 E2E maintains consistent precision across both training and inference, resulting in lower distribution mismatch than the FP8 rollout-only setting where training remains in BF16.
Citation
For more extensive experiments, ablation studies, and analysis on FP8 reinforcement learning, please refer to our technical report:
@article{qiu2026fp8rl,
title={FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning},
author={Qiu, Zhaopeng and Yu, Shuang and Zhang, Jingqi and Zhang, Shuai and Huang, Xue and Yang, Jingyi and Lai, Junjie},
journal={arXiv preprint arXiv:2601.18150},
year={2026},
url={https://arxiv.org/abs/2601.18150}
}