mshahidul
Initial commit of readCtrl code without large models
030876e

Algorithm Baselines

Last updated: 06/18/2025.

Math related datasets

GSM8k

Assuming GSM8k/math dataset is preprocessed via:

python3 examples/data_preprocess/*.py

Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.

Hardware Model Method Test score Details
NVIDIA GPU google/gemma-2-2b-it hf checkpoint 23.9 Huggingface
NVIDIA GPU google/gemma-2-2b-it SFT 52.06 command and logs
NVIDIA GPU google/gemma-2-2b-it SFT + PPO 64.02 command and logs, wandb
NVIDIA GPU Qwen/Qwen2.5-0.5B-Instruct hf checkpoint 49.6 Qwen blog
NVIDIA GPU Qwen/Qwen2.5-0.5B-Instruct PPO 56.7 command and log
NVIDIA GPU Qwen/Qwen2.5-0.5B-Instruct PRIME 58.7 script, wandb
NVIDIA GPU Qwen/Qwen2.5-0.5B-Instruct GRPO-LoRA 54.3 command and logs
NVIDIA GPU Qwen/Qwen2.5-1.5B-Instruct GRPO-LoRA 77.9 command and logs
NVIDIA GPU Qwen/Qwen2.5-3B-Instruct GRPO-LoRA 86.1 command and logs
NVIDIA GPU deepseek-ai/deepseek-llm-7b-chat PPO (Megatron) 69.5 [1] log, wandb
NVIDIA GPU Qwen/Qwen2-7B-Instruct GRPO 89 script
NVIDIA GPU Qwen/Qwen2-7B-Instruct GRPO (FSDP2) 89.8 log
NVIDIA GPU Qwen/Qwen2-7B-Instruct GRPO (Megatron) 89.6 log
NVIDIA GPU Qwen/Qwen2.5-7B-Instruct ReMax 97 script, wandb
NVIDIA GPU Qwen/Qwen2.5-7B-Instruct SPPO 65.6 (MATH) SPPO script
NVIDIA GPU Qwen/Qwen2.5-7B-Instruct GRPO-LoRA 93.4 command and logs
NVIDIA GPU Mixtral-8x22B-Instruct-v0.1 Instruct model 83.7 Qwen Blog
NVIDIA GPU Mixtral-8x22B-Instruct-v0.1 RLOO (Megatron) 92.3 wandb
NVIDIA GPU Qwen/Qwen2.5-7B-Instruct SPIN 92 script
NVIDIA GPU Qwen/Qwen2-7B-Instruct GPG 88 log, wandb
NVIDIA GPU Qwen/Qwen2-7B-Instruct GPG (Megatron) 88 log, wandb
NVIDIA GPU Qwen/Qwen2.5-VL-7B-Instruct GRPO (Megatron) 65.4 (GEO3k) script, wandb
AMD MI300 deepseek-ai/deepseek-llm-7b-chat PPO 70.5 [1] log
AMD MI300 deepseek-ai/deepseek-llm-7b-chat GRPO 71.4 [1] log
NVIDIA GPU Qwen/Qwen2.5-14B-Instruct GRPO-LoRA 94.6 command and logs
NVIDIA GPU Qwen/Qwen2.5-32B-Instruct GRPO-LoRA 95.8 command and logs
NVIDIA GPU Qwen/Qwen2.5-72B-Instruct GRPO-LoRA 96.0 command and logs

DAPO math-17k

Note:

  • For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.
Hardware Model Method Test score Details
NVIDIA GPU Qwen/Qwen2.5-Math-7B (32k) DAPO 36.3 command, logs
NVIDIA GPU Qwen/Qwen2.5-7B-Instruct DAPO + Code Interpreter 40.0 command

Coding related datasets

Below is the result on leetcode if not specified otherwise.

Hardware Model Method Test score Details
NVIDIA GPU PRIME-RL/Eurus-2-7B-SFT RPIME 36.1 script, swanlab

Notes

[1] During evaluation, we have only extracted answers following the format "####". A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.

[2] The default value of actor_rollout_ref.actor.entropy_coeff is set to 0.0 since verl 0.3.x on 2025-05-30, which is different from previous versions.