mshahidul
Initial commit of readCtrl code without large models
030876e

FP8 rollout for verl

Last updated: 12/4/2025

This document introduces FP8 rollout in verl.

We monkey patch several vLLM functions to enable FP8 rollout for reinforcement learning:

  1. Quantize weights: Quantize model weights on-the-fly from higher-precision formats to FP8.
  2. Process weights after loading: For vLLM, we replace the vllm.model_executor.layers.quantization.fp8.Fp8LinearMethod.process_weights_after_loading function to handle weight processing after quantization. For SGLang, this patch is not needed as it natively supports loading quantized weights.

Support Matrix

  • FP8 blockwise quantization for rollout
    • Used in Deepseek, which is 1x128 quantization for activations and 128x128 quantization for model weights
  • Dense models and MoE models
  • Async rollout interfaces
  • vLLM 0.10.x & vLLM 0.11 & SGlang 0.5.5
  • FSDP and Megatron training backends

Experiments and Outcomes

Qwen3-8B-Base Dense Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • vLLM(FP8 spmd rollout) + FSDP
    • Note that SPMD rollout has been deprecated, so we removed the FP8 SPMD rollout.
  • Prompt batch size 32, n=16.
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 8*H100
  • vLLM 0.10.0+CUDA 12.6 vs vLLM 0.11.0+CUDA 12.9

Accuracy Qwen3-8b-base_fp8_acc dark green: BF16, orange: FP8 rollout + token-level TIS, light green: FP8 rollout without TIS

Results and observations:

  • With TIS, FP8 rollout aligns with BF16
  • Obvious accuracy drop when TIS is not enabled
  • Higher mismatch kl but within acceptable range throughout the training

Performance

Qwen3-8b-base_fp8_rollout_perf green: BF16, orange: FP8 rollout + CUDA12.6 + DeepGemm, purple: FP8 rollout + CUDA 12.9 + DeepGemm

Results and observations:

  • FP8 rollout leads to around ~12% rollout speedup with CUDA 12.6 + DeepGemm
  • When upgrading to CUDA 12.9, speedup can be up to ~18%

Qwen3-30B-A3B-Base MoE Model

Configuration

  • DAPO recipe. AIME24 online validation.
  • FP8 async rollout, vLLM+FSDP
  • Prompt batch size 32
  • Rollout batch size: 32*3*16
  • Train_batch_size & ppo_mini_batch_size 32
  • Max response length 20K
  • Token-level TIS, C=2
  • 2*8*H100
  • vLLM 0.10.0+CUDA 12.6

Please refer to recipe/dapo/run_dapo_qwen3_moe_30b_vllm_fp8_rollout.sh

Accuracy Qwen3-30b-a3b_fp8_acc grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS

Results and observations:

  • Rollout & training distribution mismatch is in general higher for MoE
  • Rollout correction required even for BF16
  • FP8 rollout with token-level TIS aligns with BF16

Performance

Qwen3-30b-a3b_fp8_perf grey: BF16 + token-level TIS, red: FP8 rollout + token-level TIS​

Results and observations:

  • FP8 rollout : over 35% rollout speedup
  • Expecting more perf gain with CUDA 12.9

Usage

FP8 can be enabled in the config file verl/trainer/config/ppo_megatron_trainer.yaml:

  rollout:
    quantization: "fp8"

Or it can be enabled by command line:

  • actor_rollout_ref.rollout.quantization=fp8

Please refer to recipe/dapo/run_dapo_qwen3_moe_30b_vllm_fp8_rollout.sh