YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GRPO: Group Relative Policy Optimization for Mathematical Reasoning

Project Overview

This project implements Dual-Pro GRPO (Group Relative Policy Optimization), an advanced reinforcement learning framework designed to enhance large language models' mathematical reasoning capabilities through innovative reward mechanisms and training strategies.

Key Innovation: Dual-Process Reward Models with Progressive Rewards and Correct Prefix Truncation


Current Status (January 22, 2026)

Latest Training Results ✅

  • Final Checkpoint: global_step_3360 (H200 training completed)
  • Training Reward: 0.420 (+28% improvement from initial 0.328)
  • Training Dataset: 1,800 math Olympiad problems (Math3to5)
  • Training Duration: ~48 hours on 4× H200 GPUs
  • Training Date: January 20-22, 2026

Evaluation Results (Step 3360)

Test Set Samples Accuracy Baseline Change
MATH-500 500 46.6% 59.8% -13.2% ⚠️
MinervaMath 272 13.97% 20.2% -6.23% ⚠️
GPQA Diamond 198 5.56% 7.6% -2.04% ⚠️
AMC 23 40 0.0% 0.0% 0%
AIME 2024 30 0.0% 0.0% 0%

⚠️ Known Issue: Despite training reward improvement (+28%), test performance decreased (-6% to -13%). This indicates potential overfitting or reward signal mismatch. See Challenges section for details.

Training Progress Timeline

Step Time Reward Notes
100 2026-01-20 18:10 0.328 Initial checkpoint
500 2026-01-20 19:17 - Early training
1000 2026-01-21 04:08 - -
2000 2026-01-21 15:13 - Mid-training
3000 2026-01-21 21:20 - -
3360 2026-01-22 01:19 0.420 Final checkpoint

Core Innovations

1. Dual-Verifier PRM (Process Reward Model)

Strategy: Uncertainty Arbitrator with Conservative Fusion

Models:

  • ReasonFlux-PRM-7B: GPU 2
  • Qwen2.5-Math-PRM-7B: GPU 3

Fusion Method:

# Conservative scoring: p_t = min(model1_score, model2_score)
p_t = min(ReasonFlux(C_t), QwenPRM(C_t))

# Prefix truncation: L = max{t | ∀k≤t, p_k ≥ τ_ok}
L = max_step_where_all_scores_above_threshold(tau_ok=0.7)

# Min-form value: v_t = min_{k=t}^{L} p_k
# Aggregation: R_proc = (1/L) * Σ_{t=1}^{L} v_t

Accuracy: 92.22% on validation set

Fusion Strategy Rankings (from 3540-problem experiment):

Strategy Accuracy F1 Score Recommendation
uncertainty_arbitrator 0.9222 0.9579 Best
weighted_0.3_0.7 0.9193 0.9565 ✅ Good
geometric_mean 0.9174 0.9556 ✅ Good
min 0.6818 0.8010 ❌ Too conservative

Uncertainty Arbitrator Algorithm:

std = standard_deviation([p_t^1, p_t^2])

if std < 0.1:  # High consistency
    p_t = (p_t^1 + p_t^2) / 2  # Arithmetic mean
elif std > 0.3:  # Low consistency (high uncertainty)
    # Conservative weighting: bias toward lower score
    p_t = 0.3 * max(p_t^1, p_t^2) + 0.7 * min(p_t^1, p_t^2)
else:  # Medium consistency
    # Linear interpolation
    p_t = α * conservative_fused + (1-α) * arithmetic_mean

Consensus Truncation (recommended):

# Only truncate if BOTH PRMs think step is wrong
if p_t^1 < 0.4 and p_t^2 < 0.4:
    truncate_at_step_t
else:
    keep_step_t  # Avoid false positives

2. Progressive Reward System

Core Idea: Reward based on target answer probability increments during reasoning

Algorithm:

# Calculate probability increment at each step
Δq_t = q_t - q_{t-1}

# Segmented reward function
if Δq_t > ε:  # Effective progress
    r_t = α × Δq_t
elif |Δq_t| ≤ ε:  # Stagnation
    r_t = -β
else:  # Regression (Δq_t < -ε)
    r_t = -γ × |Δq_t|

Parameters:

  • ε = 0.01 (progress threshold)
  • α = 0.5 (progress reward coefficient)
  • β = 0.005 (stagnation penalty)
  • γ = 0.1 (regression penalty coefficient)

Optimization: KV Cache enabled for faster probability computation

3. Correct Prefix Truncation

Strategy: Truncate at first error step identified by PRM

Implementation:

  • Treat correct prefixes as positive examples
  • Penalize continued generation after truncation point
  • Prevents model from learning from incorrect reasoning chains
  • Avoid placing EOS token at truncation point (model learns "unfinished" state)

4. Auxiliary Rewards

  • Format Reward (r_fmt): Check for \boxed{} format (weight: 0.1)
  • Waste Penalty (r_waste): Penalize generation after truncation (weight: 0.05)
  • Calibration Penalty (r_cal): Penalize overconfidence (weight: 0.05)

5. Total Reward Formula

R_total = w_proc * R_proc + w_prog * R_prog + w_final * R_final
        + w_fmt * R_fmt + w_waste * R_waste + w_cal * R_cal

Default weights:
- w_final = 1.0   (final answer correctness)
- w_proc  = 0.5   (process reward)
- w_prog  = 0.2   (progressive reward)
- w_fmt   = 0.1   (format reward)
- w_waste = 0.05  (waste penalty)
- w_cal   = 0.05  (calibration penalty)

Project Structure

GRPO/
├── grpo-pro/                         # Main implementation directory
│   ├── configs/                      # Configuration files
│   │   ├── dual_pro_v3.yaml         # Main training config (v3.0)
│   │   └── test_small.yaml          # Small dataset test config
│   ├── src/                         # Source code
│   │   ├── rewards/                 # Reward system (core)
│   │   │   ├── dual_prm.py          # Dual-PRM implementation
│   │   │   ├── progressive.py       # Progressive reward system
│   │   │   ├── reward_manager.py    # Reward orchestration
│   │   │   └── auxiliary.py         # Auxiliary rewards
│   │   ├── data/                    # Data processing
│   │   │   └── math_dataset.py      # Data loading pipeline
│   │   └── metrics/                 # Monitoring metrics
│   │       └── monitoring.py        # Training monitoring
│   ├── scripts/                     # Running scripts
│   │   ├── train_dual_pro.sh        # Single-node training script
│   │   └── train_h200.sh            # H200 training script
│   ├── data/                        # Data directory
│   │   ├── processed/               # Processed datasets
│   │   │   ├── math_valid_2000.parquet  # 1,800 training samples
│   │   │   ├── math_olympiads_aime_train.parquet
│   │   │   └── test/                # Test datasets
│   │   │       ├── Math-500/
│   │   │       ├── AIME_2024/
│   │   │       ├── AMC23/
│   │   │       ├── GPQA_diamond/
│   │   │       └── MinervaMath/
│   │   ├── train/                   # Training data
│   │   └── val/                     # Validation data
│   ├── outputs/                     # Training outputs
│   │   ├── h200_single_node/        # H200 training outputs
│   │   │   └── global_step_*/       # Checkpoints (every 100 steps)
│   │   │       └── global_step_3360/  # Final checkpoint
│   │   └── hf_models/               # HuggingFace format models
│   │       └── global_step_3360/    # Final model in HF format
│   ├── evaluation_results/          # Evaluation summaries
│   │   ├── base_model/              # Baseline model results
│   │   ├── base_model_h100/         # Baseline (H100) results
│   │   ├── step_3360_final/         # Final checkpoint results
│   │   │   └── checkpoint_global_step_3360_accelerated/
│   │   │       ├── Math-500/        # 46.6% accuracy
│   │   │       ├── AIME_2024/       # 0% accuracy
│   │   │       ├── MinervaMath/     # 13.97% accuracy
│   │   │       └── GPQA_diamond/    # 5.56% accuracy
│   ├── logs/                        # Training logs
│   ├── 训练记录_h200/               # H200 training records (Chinese)
│   ├── tensorboard_log/             # TensorBoard logs
│   ├── 训练曲线/                    # Training curve visualizations
│   ├── 训练记录/                    # General training records
│   ├── models/                      # Model directory
│   │   ├── Qwen2.5-7B-Instruct/     # Policy model
│   │   ├── ReasonFlux-7B/           # PRM model 1
│   │   └── Qwen2.5-Math-PRM-7B/     # PRM model 2
│   ├── train_dual_pro.py            # Training entry point
│   ├── train_h200_direct.py         # H200-specific training
│   ├── verify_setup.py              # Environment verification
│   ├── README_DUAL_PRO.md           # Detailed Dual-Pro documentation
│   ├── H200_TRAINING_STATUS.md      # H200 training status report
│   └── 训练指标说明.md              # Training metrics explanation
├── verl/                            # VERL framework (RL framework)
│   ├── verl/
│   │   ├── trainer/                 # GRPO/PPO trainers
│   │   │   └── ppo/                 # PPO/GRPO implementation
│   │   ├── workers/                 # Actor, Rollout, Reward workers
│   │   ├── protocols/               # Data protocols
│   │   └── utils/                   # Utility functions
│   └── examples/                    # Example configurations
├── Math-Verify/                     # Mathematical evaluation framework
│   ├── eval/
│   │   ├── math/                   # Math evaluation
│   │   └── format_utils.py         # Format checking
│   └── test_sets/                   # Test datasets
├── QUICKSTART.md                    # Quick start guide
└── README.md                        # This file

Training Configuration

Hardware

Single Node Training (H200):

  • GPUs: 4× H200 (or A100)
  • Memory: 256 GB RAM
  • Partition: accelerated-h200
  • Time Limit: 48 hours

GPU Allocation:

GPU 0-1: Actor training (FSDP2, 2 GPUs)
GPU 2:   ReasonFlux-PRM-7B
GPU 3:   Qwen2.5-Math-PRM-7B

Multi-Node Training (2× H100):

  • Nodes: 2 H100 nodes
  • Total GPUs: 8 (4 per node)
  • Time Limit: 48 hours per session
  • Auto-resume: Yes (from latest checkpoint)

Model Configuration

model:
  base_model: "Qwen/Qwen2.5-7B-Instruct"
  dtype: "bfloat16"
  enable_gradient_checkpointing: true
  use_ema_reference: true
  ema_decay: 0.999

Training Hyperparameters

training:
  learning_rate: 1.0e-6
  min_learning_rate: 1.0e-7
  warmup_ratio: 0.1
  weight_decay: 0.01
  batch_size_per_device: 4
  gradient_accumulation_steps: 4
  total_epochs: 30  # H200 training
  max_grad_norm: 1.0

generation:
  num_generations: 16  # Group Size (G)
  temperature: 0.6
  top_p: 0.95
  max_new_tokens: 1024

Reward Weights

rewards:
  weights:
    proc: 0.5      # Process reward weight
    prog: 0.2      # Progressive reward weight
    final: 1.0     # Final answer reward (dominant)
    fmt: 0.1       # Format reward
    waste: 0.05    # Waste penalty
    cal: 0.05      # Calibration penalty

PRM Configuration

rewards:
  prm:
    fusion_strategy: "uncertainty_arbitrator"  # Best strategy (0.9222 accuracy)
    tau_ok: 0.7                                # Threshold for valid step
    use_vllm: false                            # Use HuggingFace
    truncation_mode: "consensus"               # Consensus truncation
    consensus_threshold: 0.4                   # Both PRMs < 0.4 to truncate

    # Uncertainty Arbitrator parameters
    uncertainty_low_threshold: 0.1             # std < 0.1 → high consistency
    uncertainty_high_threshold: 0.3            # std > 0.3 → low consistency
    conservative_weight: 0.3                   # Conservative fusion weight

Datasets

Training Data Summary (Total: 901,676 samples)

Dataset Samples Source Difficulty Usage
NuminaMath 863,760 AI-MO Mixed Pretraining
Math3to5 18,328 chenggong1995 Level 3-5 Main Training
MATH 5,000 hendrycks Level 1-5 Standard benchmark
OlympiadBench 1,389 OlympiadBench Competition Competition level
GSM8K 8,792 openai Easy Grade school math
AMC23 40 math-ai Hard Competition
AIME 30 math-ai Very Hard Competition

Current Training Set

  • File: data/processed/math_valid_2000.parquet
  • Samples: 1,800 problems (filtered from Math3to5)
  • Format: Parquet with columns: problem, solution, answer, uid, source, level, type
  • Source: Math Olympiad problems (Level 3-5)

Test Sets

Located in data/processed/test/:

  • Math-500/: MATH benchmark (500 samples) - Standard evaluation
  • AIME_2024/: AIME 2024 problems (30 samples) - Competition level
  • math-ai--amc23/: AMC 23 problems (40 samples) - Competition
  • MinervaMath/: Minerva math test set (272 samples) - Research benchmark
  • GPQA_diamond/: GPQA diamond level (198 samples) - Expert level
  • OlympiadBench_maths_en/: Olympiad math problems - Competition

Training Pipeline

1. Data Processing

# Raw JSONL → Preprocessing → Parquet → Train/Val Split
python scripts/preprocess_data.py \
    --input data/raw/math3to5.jsonl \
    --output data/processed/math_valid_2000.parquet \
    --format parquet

2. Training Submission

Single Node (H200):

cd grpo-pro
sbatch scripts/train_h200.sh

Multi-Node (2× H100):

cd grpo-pro
# Submit training job (auto-resume from checkpoint)
sbatch scripts/train_dual_pro_2nodes.sh

3. Training Monitoring

# Check job status
squeue -u $USER

# Real-time training log
tail -f outputs/h200_single_node/train.log

# Check GPU utilization
watch -n 1 nvidia-smi

# View detailed metrics
tail -f 训练记录_h200/training_details_step_*.jsonl

# TensorBoard
tensorboard --logdir tensorboard_log/ --port 6006

4. Evaluation

# Evaluate checkpoint
python scripts/evaluate_math_model.py \
    --checkpoint outputs/hf_models/global_step_3360 \
    --test_file data/processed/test/Math-500 \
    --output evaluation_results/step_3360_final/

# Or use the wrapper script
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360

Monitored Metrics

Key Metrics

Metric Meaning Target Current Value
Pass@1 Single-shot accuracy ↑ Increase -
VLR (Valid Length Ratio) Effective reasoning length ↑ Increase -
ISR (Ineffective Step Ratio) Low-quality step ratio ↓ Decrease -
Self-Deception Gap: |q_final - Pass@1| ↓ Decrease -
R_proc_avg Average process reward ↑ Increase 0.420 (final)
R_prog_avg Average progressive reward ↑ Increase -
PRM Agreement Correlation between PRMs → Monitor -

TensorBoard Visualization

tensorboard --logdir grpo-pro/tensorboard_log --port 6006

Panels:

  • training/loss: Actor loss curve
  • training/reward: Average reward
  • reward/r_proc: Process reward
  • reward/r_prog: Progressive reward
  • metrics/pass_at_1: Pass rate
  • metrics/vlr: Valid length ratio
  • truncation/ratio: Truncation ratio

Challenges and Issues

1. Performance Regression ⚠️ Critical

Problem: Training reward increased by 28%, but test accuracy decreased by 6-13%

Current Status:

  • Training reward: 0.328 → 0.420 (+28%)
  • MATH-500: 59.8% → 46.6% (-13.2%)
  • MinervaMath: 20.2% → 13.97% (-6.23%)
  • GPQA Diamond: 7.6% → 5.56% (-2.04%)

Possible Causes:

  1. Overfitting: Model memorized training patterns
  2. Reward Signal Mismatch: PRM scores don't correlate with actual correctness
  3. Data Leakage: Train/test set overlap
  4. Temperature Mismatch: Training uses temp=0.6, evaluation uses temp=0.0

Investigation Needed:

  • Analyze reward-answer correlation
  • Verify train/test set separation
  • Evaluate with matching temperature (0.6)
  • Check PRM quality on validation set
  • Perform ablation study on reward components

2. Resource Constraints

Issues:

  • OOM (Out of Memory) during training
  • PRM model loading failures
  • FlashOptimization compatibility issues
  • Multi-node coordination challenges

Solutions Applied:

  • Reduced batch size and gradient accumulation
  • Disabled flashinfer (use native PyTorch sampling)
  • CPU offloading for optimizer states
  • FSDP2 for efficient memory management
  • Environment variable fixes for Ray/vLLM

3. Training Efficiency

Current: 3-4 days for full dataset training

Optimization Opportunities:

  • Multi-GPU PRM inference
  • Better caching strategies
  • Mixed precision training optimization
  • Distributed data loading

Technical Stack

Frameworks

  • VERL: VolcEngine Reinforcement Learning framework

  • PyTorch: Deep learning framework

    • Version: 2.0+
    • Distributed training (NCCL)
    • bfloat16 mixed precision
  • vLLM: High-throughput LLM inference

    • Version: 0.6.6+
    • Multi-GPU support
    • KV cache optimization
    • Flashinfer disabled (compatibility issues)
  • Transformers: HuggingFace transformers

    • Model loading and tokenization
    • Chat template support
  • Ray: Distributed computing

    • Actor/Rollout worker coordination
    • GPU resource management
    • Multi-node training support

Environment

# Conda environment
conda activate grpo

# Key dependencies
- torch==2.0.1
- transformers>=4.40.0
- verl (custom build from source)
- vllm==0.6.6
- ray>=2.9.0
- pandas
- pyarrow
- numpy
- wandb (logging)
- tensorboard

Evaluation Framework

Math-Verify Integration

Located in Math-Verify/ directory:

Features:

  • Exact match evaluation
  • Format checking (\boxed{})
  • Numerical answer extraction
  • LaTeX expression parsing
  • Support for multiple test formats

Usage:

from MathVerify import eval_math

results = eval_math(
    model_path="outputs/hf_models/global_step_3360",
    test_file="data/processed/test/Math-500",
    batch_size=32,
    max_length=4096,
    temperature=0.0
)

Test Coverage:

  • MATH-500 (standard benchmark)
  • AIME 2024 (competition)
  • AMC 23 (competition)
  • MinervaMath (research benchmark)
  • GPQA Diamond (expert level)

Key Files Reference

Configuration Files

Training Scripts

Source Code

Documentation


Reproducing the Results

Step 1: Environment Setup

# Create conda environment
conda create -n grpo python=3.11 -y
conda activate grpo

# Install dependencies
pip install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers>=4.40.0
pip install vllm==0.6.6
pip install ray>=2.9.0
pip install tensordict
pip install hydra-core omegaconf
pip install datasets pandas pyarrow
pip install flash-attn --no-build-isolation
pip install wandb tensorboard

# Install VERL framework
cd verl
pip install -e .
cd ..

Step 2: Download Models

# Download base model
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
    --local-dir grpo-pro/models/Qwen2.5-7B-Instruct

# Download PRM models
huggingface-cli download ReasonFlux/ReasonFlux-7B \
    --local-dir grpo-pro/models/ReasonFlux-7B

huggingface-cli download Qwen/Qwen2.5-Math-PRM-7B \
    --local-dir grpo-pro/models/Qwen2.5-Math-PRM-7B

Step 3: Prepare Data

cd grpo-pro
# Data is already processed in data/processed/
# Verify files exist
ls -lh data/processed/math_valid_2000.parquet

Step 4: Run Training

# Submit to SLURM (H200)
sbatch scripts/train_h200.sh

# Or run locally for testing
python train_h200_direct.py --config configs/test_small.yaml

Step 5: Evaluate

# After training completes
python scripts/evaluate_math_model.py \
    --checkpoint outputs/hf_models/global_step_3360 \
    --test_sets Math-500 AIME_2024

Future Work

High Priority

  1. Investigate Performance Regression ⚠️

    • Analyze reward-answer correlation
    • Fix temperature mismatch
    • Verify data integrity
    • Perform ablation studies
  2. Improve Reward Signal

    • Align PRM scores with actual accuracy
    • Add diversity rewards
    • Implement dynamic reward weighting
    • Experiment with different fusion strategies
  3. Regularization

    • Add dropout
    • Implement early stopping
    • Data augmentation
    • Weight decay tuning

Medium Priority

  1. Scale Training

    • Multi-node training improvements
    • Larger batch sizes
    • More training data (full NuminaMath: 863k samples)
    • Curriculum learning
  2. Evaluation

    • More test sets
    • Ablation studies
    • Comparison with baselines
    • Human evaluation

Low Priority

  1. Optimization
    • Faster PRM inference
    • Better caching
    • Gradient checkpointing improvements
    • Memory optimization

Troubleshooting

Common Issues

Q1: OOM during training

# Reduce batch size in config
batch_size_per_device: 2
# Enable CPU offloading
export FSDP_OFFLOAD_OPTIMIZER=1
# Reduce GPU memory utilization
gpu_memory_utilization: 0.3

Q2: PRM loading fails

# Check model paths
ls -lh models/ReasonFlux-7B/
ls -lh models/Qwen2.5-Math-PRM-7B/
# Verify with verify_setup.py
python verify_setup.py
# Force HuggingFace backend
use_vllm: false

Q3: Flashinfer errors

# Disable flashinfer
export VLLM_USE_FLASHINFER_SAMPLER=0
# Or patch with disable_flashinfer.py
python disable_flashinfer.py

Q4: Ray GPU visibility issues

# Allow all processes to see all GPUs
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
# Clear ROCR_VISIBLE_DEVICES
unset ROCR_VISIBLE_DEVICES

Q5: Training not converging

# Adjust learning rate
actor:
  optim:
    lr: 5e-7  # Lower learning rate
# Adjust reward weights
rewards:
  weights:
    final: 2.0  # Increase final answer weight
    proc: 0.3   # Decrease process reward weight

Q6: Multi-node job pending

# Check queue status
squeue -p accelerated-h100 --format="%10i %9P %20j %8u %2t %10M %6D %R"
# Check node resources
sinfo -p accelerated-h100
# Submit during off-peak hours (nights/weekends)

Q7: Resume from checkpoint

# Just resubmit the same script
sbatch scripts/train_dual_pro_2nodes.sh
# Script automatically detects latest checkpoint
ls -lht outputs/dual_pro_2nodes/global_step_*/

Citation

If you use this code or implementation, please cite:

@software{grpo_math_reasoning,
  title={Dual-Pro GRPO: Group Relative Policy Optimization for Mathematical Reasoning},
  author={GRPO Team},
  year={2026},
  url={https://github.com/your-repo/grpo}
}

License and Acknowledgments

  • VERL Framework: VolcEngine Reinforcement Learning (https://github.com/volcengine/verl)
  • Base Model: Qwen2.5-7B-Instruct (Alibaba)
  • PRM Models: ReasonFlux-7B, Qwen2.5-Math-PRM-7B
  • Evaluation: Math-Verify framework
  • Inspiration: DeepSeek-Math, OpenAI's Process Reward Models

Contact

Project Maintainer: tum_fmp0582@hk-project

Project Location:

/hkfs/work/workspace/scratch/tum_fmp0582-my_rl_ws/tum_fmp0582-dndworkspace-1766456703/GRPO

Last Updated: January 22, 2026


Appendix: Training Commands Quick Reference

# === Environment ===
conda activate grpo
cd grpo-pro

# === Training ===
sbatch scripts/train_h200.sh                    # Submit H200 training
sbatch scripts/train_dual_pro_2nodes.sh         # Submit 2-node training
squeue -u $USER                                 # Check job status

# === Monitoring ===
tail -f outputs/h200_single_node/train.log     # Monitor training
tail -f 训练记录_h200/training_details_step_*.jsonl  # Detailed logs
tensorboard --logdir tensorboard_log/           # TensorBoard
watch -n 1 nvidia-smi                          # GPU usage

# === Evaluation ===
bash scripts/evaluate_model.sh --base           # Evaluate baseline
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360

# === Debug ===
python verify_setup.py                         # Verify environment
bash check_training.sh                         # Check training status

Note: This project represents ongoing research into improving mathematical reasoning in LLMs through reinforcement learning. The performance regression observed in the latest training run highlights the complexity of aligning reward signals with actual task performance and is an active area of investigation. Contributions and suggestions are welcome!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support