initial clean commit

1faccd4 about 1 month ago

6.55 kB

	# FP8 RL in verl

	Last updated: 03/05/2026

	verl supports two FP8 modes for accelerating RL training:

	\| Mode \| Training Precision \| Rollout Precision \|
	\|------\|-------------------\|-------------------\|
	\| FP8 Rollout Only \| BF16 \| FP8 \|
	\| FP8 End-to-End \| FP8 (Megatron) \| FP8 (vLLM) \|

	> [!TIP]
	> For ready-to-run scripts, see the [low-precision recipe directory](https://github.com/verl-project/verl-recipe/low_precision).

	---

	## FP8 Rollout Only

	FP8 rollout-only mode keeps training in BF16 and quantizes rollout inference to FP8. This reduces GPU memory during generation and speeds up rollout without affecting training precision.

	### Implementation

	We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning:

	1. Quantize weights: Quantize model weights on-the-fly from higher-precision formats to FP8.
	2. Process weights after loading: For vLLM, we replace the `vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading` function to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights.

	### Support Matrix

	- FP8 blockwise quantization for rollout
	- Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
	- Dense models and MoE models
	- Async rollout interfaces
	- vLLM 0.10.x & vLLM 0.11 & vLLM 0.12 & SGLang 0.5.5
	- FSDP and Megatron training backends

	### Usage

	Enable in config file:

	```yaml
	rollout:
	quantization: "fp8"
	```

	Or via command line:

	```bash
	actor_rollout_ref.rollout.quantization=fp8
	```

	### Experiments and Outcomes

	#### Qwen3-8B-Base Dense Model

	Configuration
	- DAPO recipe. AIME24 online validation.
	- vLLM(FP8 spmd rollout) + FSDP
	- Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
	- Prompt batch size 32, n=16.
	- Rollout batch size: 32\316
	- Train_batch_size & ppo_mini_batch_size 32
	- Max response length 20K
	- Token-level TIS, C=2
	- 8*H100
	- vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

	Accuracy
	![Qwen3-8b-base_fp8_acc](
	https://github.com/Agoniii/verl/blob/xueh/fp8_pr_images/docs/advance/images/Qwen3-8b-base_fp8_acc.png?raw=true)
	dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

	Results and observations:
	- With TIS, FP8 rollout aligns with BF16
	- Obvious accuracy drop when TIS is not enabled
	- Higher mismatch kl but within acceptable range throughout the training


	Performance

	![Qwen3-8b-base_fp8_rollout_perf](
	https://github.com/Agoniii/verl/blob/xueh/fp8_pr_images/docs/advance/images/Qwen3-8b-base_fp8_rollout_perf.png?raw=true)
	green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

	Results and observations:
	- FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
	- When upgrading to CUDA 12.9, speedup can be up to ~18%

	#### Qwen3-30B-A3B-Base MoE Model

	Configuration
	- DAPO recipe. AIME24 online validation.
	- FP8 async rollout, vLLM+FSDP
	- Prompt batch size 32
	- Rollout batch size: 32\316
	- Train_batch_size & ppo_mini_batch_size 32
	- Max response length 20K
	- Token-level TIS, C=2
	- 2\8H100
	- vLLM 0.10.0+CUDA 12.6

	Accuracy
	![Qwen3-30b-a3b_fp8_acc](
	https://github.com/Agoniii/verl/blob/xueh/fp8_pr_images/docs/advance/images/Qwen3-30b-a3b_fp8_acc.png?raw=true)
	grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

	Results and observations:
	- Rollout & training distribution mismatch is in general higher for MoE
	- Rollout correction required even for BF16
	- FP8 rollout with token-level TIS aligns with BF16


	Performance

	![Qwen3-30b-a3b_fp8_perf](
	https://github.com/Agoniii/verl/blob/xueh/fp8_pr_images/docs/advance/images/Qwen3-30b-a3b_fp8_perf.png?raw=true)
	grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

	Results and observations:
	- FP8 rollout : over 35% rollout speedup
	- Expecting more perf gain with CUDA 12.9

	---

	## FP8 End-to-End (Training + Rollout)

	FP8 E2E applies FP8 to the entire RL pipeline: forward/backward passes via Transformer Engine, FP8 optimizer states, and FP8 rollout inference via vLLM. This maximizes memory savings and throughput.

	### Requirements

	- CUDA 12.9+ (required for block-wise FP8 scaling)
	- Transformer Engine with block-wise FP8 support
	- Environment variable: `NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1`

	### Key Configuration

	```yaml
	# FP8 training via Transformer Engine
	actor_rollout_ref.actor.megatron.override_transformer_config:
	fp8: "hybrid" # FP8 forward + backward; also supports "e4m3"
	fp8_recipe: "blockwise" # block-wise scaling

	# FP8 optimizer
	actor_rollout_ref.actor.optim.override_optimizer_config:
	fp8_recipe: "blockwise"

	# FP8 rollout inference (vLLM)
	actor_rollout_ref.rollout:
	quantization: fp8
	```

	### Support Matrix

	- Megatron training backend (via Megatron-Bridge)
	- Verified on Qwen3-30B-A3B and Qwen3-8B
	- Block-wise FP8 scaling (`fp8_recipe: "blockwise"`)

	### Experiments and Results

	#### Qwen3-30B-A3B MoE Model

	Configuration
	- DAPO recipe. AIME24 online validation.
	- Megatron + Megatron-Bridge, FP8 async rollout with vLLM
	- MoE router in BF16 for both vLLM and Megatron-Core
	- Prompt batch size 128, n=16
	- Max response length 20K
	- Token-level TIS, C=2
	- 2\8H100, CUDA 12.9

	![Qwen3-30b-a3b_fp8_e2e](https://github.com/user-attachments/assets/70fb1396-ec73-40d7-9a43-1d48553c0ad9)
	Orange: BF16, Green: FP8 E2E, Red: FP8 rollout + BF16 training

	Results and observations:
	- FP8 E2E achieves comparable accuracy to the BF16 baseline, with the two curves closely aligned throughout training.
	- The training/inference precision mismatch (measured by KL divergence) follows the ordering: FP8 rollout-only > FP8 E2E > BF16 E2E. This is expected, as FP8 E2E maintains consistent precision across both training and inference, resulting in lower distribution mismatch than the FP8 rollout-only setting where training remains in BF16.

	---

	## Citation

	For more extensive experiments, ablation studies, and analysis on FP8 reinforcement learning, please refer to our technical report:

	```bibtex
	@article{qiu2026fp8rl,
	title={FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning},
	author={Qiu, Zhaopeng and Yu, Shuang and Zhang, Jingqi and Zhang, Shuai and Huang, Xue and Yang, Jingyi and Lai, Junjie},
	journal={arXiv preprint arXiv:2601.18150},
	year={2026},
	url={https://arxiv.org/abs/2601.18150}
	}
	```