Algorithm Baselines
Last updated: 06/18/2025.
Math related datasets
GSM8k
Assuming GSM8k/math dataset is preprocessed via:
python3 examples/data_preprocess/*.py
Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.
| Hardware | Model | Method | Test score | Details |
|---|---|---|---|---|
| NVIDIA GPU | google/gemma-2-2b-it | hf checkpoint | 23.9 | Huggingface |
| NVIDIA GPU | google/gemma-2-2b-it | SFT | 52.06 | command and logs |
| NVIDIA GPU | google/gemma-2-2b-it | SFT + PPO | 64.02 | command and logs, wandb |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | hf checkpoint | 49.6 | Qwen blog |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | command and log |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | PRIME | 58.7 | script, wandb |
| NVIDIA GPU | Qwen/Qwen2.5-0.5B-Instruct | GRPO-LoRA | 54.3 | command and logs |
| NVIDIA GPU | Qwen/Qwen2.5-1.5B-Instruct | GRPO-LoRA | 77.9 | command and logs |
| NVIDIA GPU | Qwen/Qwen2.5-3B-Instruct | GRPO-LoRA | 86.1 | command and logs |
| NVIDIA GPU | deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron) | 69.5 [1] | log, wandb |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO | 89 | script |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (FSDP2) | 89.8 | log |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GRPO (Megatron) | 89.6 | log |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | ReMax | 97 | script, wandb |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | SPPO | 65.6 (MATH) | SPPO script |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | GRPO-LoRA | 93.4 | command and logs |
| NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1 | Instruct model | 83.7 | Qwen Blog |
| NVIDIA GPU | Mixtral-8x22B-Instruct-v0.1 | RLOO (Megatron) | 92.3 | wandb |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | SPIN | 92 | script |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GPG | 88 | log, wandb |
| NVIDIA GPU | Qwen/Qwen2-7B-Instruct | GPG (Megatron) | 88 | log, wandb |
| NVIDIA GPU | Qwen/Qwen2.5-VL-7B-Instruct | GRPO (Megatron) | 65.4 (GEO3k) | script, wandb |
| AMD MI300 | deepseek-ai/deepseek-llm-7b-chat | PPO | 70.5 [1] | log |
| AMD MI300 | deepseek-ai/deepseek-llm-7b-chat | GRPO | 71.4 [1] | log |
| NVIDIA GPU | Qwen/Qwen2.5-14B-Instruct | GRPO-LoRA | 94.6 | command and logs |
| NVIDIA GPU | Qwen/Qwen2.5-32B-Instruct | GRPO-LoRA | 95.8 | command and logs |
| NVIDIA GPU | Qwen/Qwen2.5-72B-Instruct | GRPO-LoRA | 96.0 | command and logs |
DAPO math-17k
- Training DAPO math-17k dataset: https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
- Testing: AIME'24: https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024
Note:
- For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.
| Hardware | Model | Method | Test score | Details |
|---|---|---|---|---|
| NVIDIA GPU | Qwen/Qwen2.5-Math-7B (32k) | DAPO | 36.3 | command, logs |
| NVIDIA GPU | Qwen/Qwen2.5-7B-Instruct | DAPO + Code Interpreter | 40.0 | command |
Coding related datasets
Below is the result on leetcode if not specified otherwise.
| Hardware | Model | Method | Test score | Details |
|---|---|---|---|---|
| NVIDIA GPU | PRIME-RL/Eurus-2-7B-SFT | RPIME | 36.1 | script, swanlab |
Notes
[1] During evaluation, we have only extracted answers following the format "####". A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.
[2] The default value of actor_rollout_ref.actor.entropy_coeff is set to 0.0 since verl 0.3.x on 2025-05-30, which is different from previous versions.