Algorithm Baselines

Last updated: 06/18/2025.

Math related datasets

GSM8k

Assuming GSM8k/math dataset is preprocessed via:

python3 examples/data_preprocess/*.py

Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.

Hardware	Model	Method	Test score	Details
NVIDIA GPU	google/gemma-2-2b-it	hf checkpoint	23.9	Huggingface
NVIDIA GPU	google/gemma-2-2b-it	SFT	52.06	command and logs
NVIDIA GPU	google/gemma-2-2b-it	SFT + PPO	64.02	command and logs, wandb
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	hf checkpoint	49.6	Qwen blog
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PPO	56.7	command and log
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PRIME	58.7	script, wandb
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	GRPO-LoRA	54.3	command and logs
NVIDIA GPU	Qwen/Qwen2.5-1.5B-Instruct	GRPO-LoRA	77.9	command and logs
NVIDIA GPU	Qwen/Qwen2.5-3B-Instruct	GRPO-LoRA	86.1	command and logs
NVIDIA GPU	deepseek-ai/deepseek-llm-7b-chat	PPO (Megatron)	69.5 [1]	log, wandb
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO	89	script
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (FSDP2)	89.8	log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (Megatron)	89.6	log
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	ReMax	97	script, wandb
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPPO	65.6 (MATH)	SPPO script
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	GRPO-LoRA	93.4	command and logs
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	Instruct model	83.7	Qwen Blog
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	RLOO (Megatron)	92.3	wandb
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPIN	92	script
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GPG	88	log, wandb
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GPG (Megatron)	88	log, wandb
NVIDIA GPU	Qwen/Qwen2.5-VL-7B-Instruct	GRPO (Megatron)	65.4 (GEO3k)	script, wandb
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	PPO	70.5 [1]	log
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	GRPO	71.4 [1]	log
NVIDIA GPU	Qwen/Qwen2.5-14B-Instruct	GRPO-LoRA	94.6	command and logs
NVIDIA GPU	Qwen/Qwen2.5-32B-Instruct	GRPO-LoRA	95.8	command and logs
NVIDIA GPU	Qwen/Qwen2.5-72B-Instruct	GRPO-LoRA	96.0	command and logs

DAPO math-17k

Training DAPO math-17k dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
Testing: AIME'24: https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024

Note:

For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.

Hardware	Model	Method	Test score	Details
NVIDIA GPU	Qwen/Qwen2.5-Math-7B (32k)	DAPO	36.3	command, logs
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	DAPO + Code Interpreter	40.0	command

Coding related datasets

Below is the result on leetcode if not specified otherwise.

Hardware	Model	Method	Test score	Details
NVIDIA GPU	PRIME-RL/Eurus-2-7B-SFT	RPIME	36.1	script, swanlab

Notes

[1] During evaluation, we have only extracted answers following the format "####". A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.

[2] The default value of actor_rollout_ref.actor.entropy_coeff is set to 0.0 since verl 0.3.x on 2025-05-30, which is different from previous versions.