LeTue09

initial clean commit

1faccd4 about 1 month ago

preview code

raw

history blame contribute delete

6.55 kB

FP8 RL in verl

Last updated: 03/05/2026

verl supports two FP8 modes for accelerating RL training:

Mode	Training Precision	Rollout Precision
FP8 Rollout Only	BF16	FP8
FP8 End-to-End	FP8 (Megatron)	FP8 (vLLM)

For ready-to-run scripts, see the low-precision recipe directory.

FP8 Rollout Only

FP8 rollout-only mode keeps training in BF16 and quantizes rollout inference to FP8. This reduces GPU memory during generation and speeds up rollout without affecting training precision.

Implementation

We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning:

Quantize weights: Quantize model weights on-the-fly from higher-precision formats to FP8.
Process weights after loading: For vLLM, we replace the vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading function to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights.

Support Matrix

FP8 blockwise quantization for rollout
- Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
Dense models and MoE models
Async rollout interfaces
vLLM 0.10.x & vLLM 0.11 & vLLM 0.12 & SGLang 0.5.5
FSDP and Megatron training backends

Usage

Enable in config file:

rollout:
  quantization: "fp8"

Or via command line:

actor_rollout_ref.rollout.quantization=fp8

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Configuration

DAPO recipe. AIME24 online validation.
vLLM(FP8 spmd rollout) + FSDP
- Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
Prompt batch size 32, n=16.
Rollout batch size: 32*3*16
Train_batch_size & ppo_mini_batch_size 32
Max response length 20K
Token-level TIS, C=2
8*H100
vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

Accuracy dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

Results and observations:

With TIS, FP8 rollout aligns with BF16
Obvious accuracy drop when TIS is not enabled
Higher mismatch kl but within acceptable range throughout the training

Performance

green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

Results and observations:

FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
When upgrading to CUDA 12.9, speedup can be up to ~18%

Qwen3-30B-A3B-Base MoE Model

Configuration

DAPO recipe. AIME24 online validation.
FP8 async rollout, vLLM+FSDP
Prompt batch size 32
Rollout batch size: 32*3*16
Train_batch_size & ppo_mini_batch_size 32
Max response length 20K
Token-level TIS, C=2
2*8*H100
vLLM 0.10.0+CUDA 12.6

Accuracy grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

Rollout & training distribution mismatch is in general higher for MoE
Rollout correction required even for BF16
FP8 rollout with token-level TIS aligns with BF16

Performance

grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

FP8 rollout : over 35% rollout speedup
Expecting more perf gain with CUDA 12.9

FP8 End-to-End (Training + Rollout)

FP8 E2E applies FP8 to the entire RL pipeline: forward/backward passes via Transformer Engine, FP8 optimizer states, and FP8 rollout inference via vLLM. This maximizes memory savings and throughput.

Requirements

CUDA 12.9+ (required for block-wise FP8 scaling)
Transformer Engine with block-wise FP8 support
Environment variable: NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1

Key Configuration

# FP8 training via Transformer Engine
actor_rollout_ref.actor.megatron.override_transformer_config:
  fp8: "hybrid"              # FP8 forward + backward; also supports "e4m3"
  fp8_recipe: "blockwise"    # block-wise scaling

# FP8 optimizer
actor_rollout_ref.actor.optim.override_optimizer_config:
  fp8_recipe: "blockwise"

# FP8 rollout inference (vLLM)
actor_rollout_ref.rollout:
  quantization: fp8

Support Matrix

Megatron training backend (via Megatron-Bridge)
Verified on Qwen3-30B-A3B and Qwen3-8B
Block-wise FP8 scaling (fp8_recipe: "blockwise")

Experiments and Results

Qwen3-30B-A3B MoE Model

Configuration

DAPO recipe. AIME24 online validation.
Megatron + Megatron-Bridge, FP8 async rollout with vLLM
MoE router in BF16 for both vLLM and Megatron-Core
Prompt batch size 128, n=16
Max response length 20K
Token-level TIS, C=2
2*8*H100, CUDA 12.9

Orange: BF16, Green: FP8 E2E, Red: FP8 rollout + BF16 training

Results and observations:

FP8 E2E achieves comparable accuracy to the BF16 baseline, with the two curves closely aligned throughout training.
The training/inference precision mismatch (measured by KL divergence) follows the ordering: FP8 rollout-only > FP8 E2E > BF16 E2E. This is expected, as FP8 E2E maintains consistent precision across both training and inference, resulting in lower distribution mismatch than the FP8 rollout-only setting where training remains in BF16.

Citation

For more extensive experiments, ablation studies, and analysis on FP8 reinforcement learning, please refer to our technical report:

@article{qiu2026fp8rl,
  title={FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning},
  author={Qiu, Zhaopeng and Yu, Shuang and Zhang, Jingqi and Zhang, Shuai and Huang, Xue and Yang, Jingyi and Lai, Junjie},
  journal={arXiv preprint arXiv:2601.18150},
  year={2026},
  url={https://arxiv.org/abs/2601.18150}
}