YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GRPO: Group Relative Policy Optimization for Mathematical Reasoning

Project Overview

This project implements Dual-Pro GRPO (Group Relative Policy Optimization), an advanced reinforcement learning framework designed to enhance large language models' mathematical reasoning capabilities through innovative reward mechanisms and training strategies.

Key Innovation: Dual-Process Reward Models with Progressive Rewards and Correct Prefix Truncation

Current Status (January 22, 2026)

Latest Training Results ✅

Final Checkpoint: global_step_3360 (H200 training completed)
Training Reward: 0.420 (+28% improvement from initial 0.328)
Training Dataset: 1,800 math Olympiad problems (Math3to5)
Training Duration: ~48 hours on 4× H200 GPUs
Training Date: January 20-22, 2026

Evaluation Results (Step 3360)

Test Set	Samples	Accuracy	Baseline	Change
MATH-500	500	46.6%	59.8%	-13.2% ⚠️
MinervaMath	272	13.97%	20.2%	-6.23% ⚠️
GPQA Diamond	198	5.56%	7.6%	-2.04% ⚠️
AMC 23	40	0.0%	0.0%	0%
AIME 2024	30	0.0%	0.0%	0%

⚠️ Known Issue: Despite training reward improvement (+28%), test performance decreased (-6% to -13%). This indicates potential overfitting or reward signal mismatch. See Challenges section for details.

Training Progress Timeline

Step	Time	Reward	Notes
100	2026-01-20 18:10	0.328	Initial checkpoint
500	2026-01-20 19:17	-	Early training
1000	2026-01-21 04:08	-	-
2000	2026-01-21 15:13	-	Mid-training
3000	2026-01-21 21:20	-	-
3360	2026-01-22 01:19	0.420	Final checkpoint

Core Innovations

1. Dual-Verifier PRM (Process Reward Model)

Strategy: Uncertainty Arbitrator with Conservative Fusion

Models:

ReasonFlux-PRM-7B: GPU 2
Qwen2.5-Math-PRM-7B: GPU 3

Fusion Method:

# Conservative scoring: p_t = min(model1_score, model2_score)
p_t = min(ReasonFlux(C_t), QwenPRM(C_t))

# Prefix truncation: L = max{t | ∀k≤t, p_k ≥ τ_ok}
L = max_step_where_all_scores_above_threshold(tau_ok=0.7)

# Min-form value: v_t = min_{k=t}^{L} p_k
# Aggregation: R_proc = (1/L) * Σ_{t=1}^{L} v_t

Accuracy: 92.22% on validation set

Fusion Strategy Rankings (from 3540-problem experiment):

Strategy	Accuracy	F1 Score	Recommendation
uncertainty_arbitrator	0.9222	0.9579	✅ Best
weighted_0.3_0.7	0.9193	0.9565	✅ Good
geometric_mean	0.9174	0.9556	✅ Good
min	0.6818	0.8010	❌ Too conservative

Uncertainty Arbitrator Algorithm:

std = standard_deviation([p_t^1, p_t^2])

if std < 0.1:  # High consistency
    p_t = (p_t^1 + p_t^2) / 2  # Arithmetic mean
elif std > 0.3:  # Low consistency (high uncertainty)
    # Conservative weighting: bias toward lower score
    p_t = 0.3 * max(p_t^1, p_t^2) + 0.7 * min(p_t^1, p_t^2)
else:  # Medium consistency
    # Linear interpolation
    p_t = α * conservative_fused + (1-α) * arithmetic_mean

Consensus Truncation (recommended):

# Only truncate if BOTH PRMs think step is wrong
if p_t^1 < 0.4 and p_t^2 < 0.4:
    truncate_at_step_t
else:
    keep_step_t  # Avoid false positives

2. Progressive Reward System

Core Idea: Reward based on target answer probability increments during reasoning

Algorithm:

# Calculate probability increment at each step
Δq_t = q_t - q_{t-1}

# Segmented reward function
if Δq_t > ε:  # Effective progress
    r_t = α × Δq_t
elif |Δq_t| ≤ ε:  # Stagnation
    r_t = -β
else:  # Regression (Δq_t < -ε)
    r_t = -γ × |Δq_t|

Parameters:

ε = 0.01 (progress threshold)
α = 0.5 (progress reward coefficient)
β = 0.005 (stagnation penalty)
γ = 0.1 (regression penalty coefficient)

Optimization: KV Cache enabled for faster probability computation

3. Correct Prefix Truncation

Strategy: Truncate at first error step identified by PRM

Implementation:

Treat correct prefixes as positive examples
Penalize continued generation after truncation point
Prevents model from learning from incorrect reasoning chains
Avoid placing EOS token at truncation point (model learns "unfinished" state)

4. Auxiliary Rewards

Format Reward (r_fmt): Check for \boxed{} format (weight: 0.1)
Waste Penalty (r_waste): Penalize generation after truncation (weight: 0.05)
Calibration Penalty (r_cal): Penalize overconfidence (weight: 0.05)

5. Total Reward Formula

R_total = w_proc * R_proc + w_prog * R_prog + w_final * R_final
        + w_fmt * R_fmt + w_waste * R_waste + w_cal * R_cal

Default weights:
- w_final = 1.0   (final answer correctness)
- w_proc  = 0.5   (process reward)
- w_prog  = 0.2   (progressive reward)
- w_fmt   = 0.1   (format reward)
- w_waste = 0.05  (waste penalty)
- w_cal   = 0.05  (calibration penalty)

Project Structure

GRPO/
├── grpo-pro/                         # Main implementation directory
│   ├── configs/                      # Configuration files
│   │   ├── dual_pro_v3.yaml         # Main training config (v3.0)
│   │   └── test_small.yaml          # Small dataset test config
│   ├── src/                         # Source code
│   │   ├── rewards/                 # Reward system (core)
│   │   │   ├── dual_prm.py          # Dual-PRM implementation
│   │   │   ├── progressive.py       # Progressive reward system
│   │   │   ├── reward_manager.py    # Reward orchestration
│   │   │   └── auxiliary.py         # Auxiliary rewards
│   │   ├── data/                    # Data processing
│   │   │   └── math_dataset.py      # Data loading pipeline
│   │   └── metrics/                 # Monitoring metrics
│   │       └── monitoring.py        # Training monitoring
│   ├── scripts/                     # Running scripts
│   │   ├── train_dual_pro.sh        # Single-node training script
│   │   └── train_h200.sh            # H200 training script
│   ├── data/                        # Data directory
│   │   ├── processed/               # Processed datasets
│   │   │   ├── math_valid_2000.parquet  # 1,800 training samples
│   │   │   ├── math_olympiads_aime_train.parquet
│   │   │   └── test/                # Test datasets
│   │   │       ├── Math-500/
│   │   │       ├── AIME_2024/
│   │   │       ├── AMC23/
│   │   │       ├── GPQA_diamond/
│   │   │       └── MinervaMath/
│   │   ├── train/                   # Training data
│   │   └── val/                     # Validation data
│   ├── outputs/                     # Training outputs
│   │   ├── h200_single_node/        # H200 training outputs
│   │   │   └── global_step_*/       # Checkpoints (every 100 steps)
│   │   │       └── global_step_3360/  # Final checkpoint
│   │   └── hf_models/               # HuggingFace format models
│   │       └── global_step_3360/    # Final model in HF format
│   ├── evaluation_results/          # Evaluation summaries
│   │   ├── base_model/              # Baseline model results
│   │   ├── base_model_h100/         # Baseline (H100) results
│   │   ├── step_3360_final/         # Final checkpoint results
│   │   │   └── checkpoint_global_step_3360_accelerated/
│   │   │       ├── Math-500/        # 46.6% accuracy
│   │   │       ├── AIME_2024/       # 0% accuracy
│   │   │       ├── MinervaMath/     # 13.97% accuracy
│   │   │       └── GPQA_diamond/    # 5.56% accuracy
│   ├── logs/                        # Training logs
│   ├── 训练记录_h200/               # H200 training records (Chinese)
│   ├── tensorboard_log/             # TensorBoard logs
│   ├── 训练曲线/                    # Training curve visualizations
│   ├── 训练记录/                    # General training records
│   ├── models/                      # Model directory
│   │   ├── Qwen2.5-7B-Instruct/     # Policy model
│   │   ├── ReasonFlux-7B/           # PRM model 1
│   │   └── Qwen2.5-Math-PRM-7B/     # PRM model 2
│   ├── train_dual_pro.py            # Training entry point
│   ├── train_h200_direct.py         # H200-specific training
│   ├── verify_setup.py              # Environment verification
│   ├── README_DUAL_PRO.md           # Detailed Dual-Pro documentation
│   ├── H200_TRAINING_STATUS.md      # H200 training status report
│   └── 训练指标说明.md              # Training metrics explanation
├── verl/                            # VERL framework (RL framework)
│   ├── verl/
│   │   ├── trainer/                 # GRPO/PPO trainers
│   │   │   └── ppo/                 # PPO/GRPO implementation
│   │   ├── workers/                 # Actor, Rollout, Reward workers
│   │   ├── protocols/               # Data protocols
│   │   └── utils/                   # Utility functions
│   └── examples/                    # Example configurations
├── Math-Verify/                     # Mathematical evaluation framework
│   ├── eval/
│   │   ├── math/                   # Math evaluation
│   │   └── format_utils.py         # Format checking
│   └── test_sets/                   # Test datasets
├── QUICKSTART.md                    # Quick start guide
└── README.md                        # This file

Training Configuration

Hardware

Single Node Training (H200):

GPUs: 4× H200 (or A100)
Memory: 256 GB RAM
Partition: accelerated-h200
Time Limit: 48 hours

GPU Allocation:

GPU 0-1: Actor training (FSDP2, 2 GPUs)
GPU 2:   ReasonFlux-PRM-7B
GPU 3:   Qwen2.5-Math-PRM-7B

Multi-Node Training (2× H100):

Nodes: 2 H100 nodes
Total GPUs: 8 (4 per node)
Time Limit: 48 hours per session
Auto-resume: Yes (from latest checkpoint)

Model Configuration

model:
  base_model: "Qwen/Qwen2.5-7B-Instruct"
  dtype: "bfloat16"
  enable_gradient_checkpointing: true
  use_ema_reference: true
  ema_decay: 0.999

Training Hyperparameters

training:
  learning_rate: 1.0e-6
  min_learning_rate: 1.0e-7
  warmup_ratio: 0.1
  weight_decay: 0.01
  batch_size_per_device: 4
  gradient_accumulation_steps: 4
  total_epochs: 30  # H200 training
  max_grad_norm: 1.0

generation:
  num_generations: 16  # Group Size (G)
  temperature: 0.6
  top_p: 0.95
  max_new_tokens: 1024

Reward Weights

rewards:
  weights:
    proc: 0.5      # Process reward weight
    prog: 0.2      # Progressive reward weight
    final: 1.0     # Final answer reward (dominant)
    fmt: 0.1       # Format reward
    waste: 0.05    # Waste penalty
    cal: 0.05      # Calibration penalty

PRM Configuration

rewards:
  prm:
    fusion_strategy: "uncertainty_arbitrator"  # Best strategy (0.9222 accuracy)
    tau_ok: 0.7                                # Threshold for valid step
    use_vllm: false                            # Use HuggingFace
    truncation_mode: "consensus"               # Consensus truncation
    consensus_threshold: 0.4                   # Both PRMs < 0.4 to truncate

    # Uncertainty Arbitrator parameters
    uncertainty_low_threshold: 0.1             # std < 0.1 → high consistency
    uncertainty_high_threshold: 0.3            # std > 0.3 → low consistency
    conservative_weight: 0.3                   # Conservative fusion weight

Datasets

Training Data Summary (Total: 901,676 samples)

Dataset	Samples	Source	Difficulty	Usage
NuminaMath	863,760	AI-MO	Mixed	Pretraining
Math3to5	18,328	chenggong1995	Level 3-5	Main Training
MATH	5,000	hendrycks	Level 1-5	Standard benchmark
OlympiadBench	1,389	OlympiadBench	Competition	Competition level
GSM8K	8,792	openai	Easy	Grade school math
AMC23	40	math-ai	Hard	Competition
AIME	30	math-ai	Very Hard	Competition

Current Training Set

File: data/processed/math_valid_2000.parquet
Samples: 1,800 problems (filtered from Math3to5)
Format: Parquet with columns: problem, solution, answer, uid, source, level, type
Source: Math Olympiad problems (Level 3-5)

Test Sets

Located in data/processed/test/:

Math-500/: MATH benchmark (500 samples) - Standard evaluation
AIME_2024/: AIME 2024 problems (30 samples) - Competition level
math-ai--amc23/: AMC 23 problems (40 samples) - Competition
MinervaMath/: Minerva math test set (272 samples) - Research benchmark
GPQA_diamond/: GPQA diamond level (198 samples) - Expert level
OlympiadBench_maths_en/: Olympiad math problems - Competition

Training Pipeline

1. Data Processing

# Raw JSONL → Preprocessing → Parquet → Train/Val Split
python scripts/preprocess_data.py \
    --input data/raw/math3to5.jsonl \
    --output data/processed/math_valid_2000.parquet \
    --format parquet

2. Training Submission

Single Node (H200):

cd grpo-pro
sbatch scripts/train_h200.sh

Multi-Node (2× H100):

cd grpo-pro
# Submit training job (auto-resume from checkpoint)
sbatch scripts/train_dual_pro_2nodes.sh

3. Training Monitoring

# Check job status
squeue -u $USER

# Real-time training log
tail -f outputs/h200_single_node/train.log

# Check GPU utilization
watch -n 1 nvidia-smi

# View detailed metrics
tail -f 训练记录_h200/training_details_step_*.jsonl

# TensorBoard
tensorboard --logdir tensorboard_log/ --port 6006

4. Evaluation

# Evaluate checkpoint
python scripts/evaluate_math_model.py \
    --checkpoint outputs/hf_models/global_step_3360 \
    --test_file data/processed/test/Math-500 \
    --output evaluation_results/step_3360_final/

# Or use the wrapper script
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360

Monitored Metrics

Key Metrics

Metric	Meaning	Target	Current Value
Pass@1	Single-shot accuracy	↑ Increase	-
VLR (Valid Length Ratio)	Effective reasoning length	↑ Increase	-
ISR (Ineffective Step Ratio)	Low-quality step ratio	↓ Decrease	-
Self-Deception	Gap: \|q_final - Pass@1\|	↓ Decrease	-
R_proc_avg	Average process reward	↑ Increase	0.420 (final)
R_prog_avg	Average progressive reward	↑ Increase	-
PRM Agreement	Correlation between PRMs	→ Monitor	-

TensorBoard Visualization

tensorboard --logdir grpo-pro/tensorboard_log --port 6006

Panels:

training/loss: Actor loss curve
training/reward: Average reward
reward/r_proc: Process reward
reward/r_prog: Progressive reward
metrics/pass_at_1: Pass rate
metrics/vlr: Valid length ratio
truncation/ratio: Truncation ratio

Challenges and Issues

1. Performance Regression ⚠️ Critical

Problem: Training reward increased by 28%, but test accuracy decreased by 6-13%

Current Status:

Training reward: 0.328 → 0.420 (+28%)
MATH-500: 59.8% → 46.6% (-13.2%)
MinervaMath: 20.2% → 13.97% (-6.23%)
GPQA Diamond: 7.6% → 5.56% (-2.04%)

Possible Causes:

Overfitting: Model memorized training patterns
Reward Signal Mismatch: PRM scores don't correlate with actual correctness
Data Leakage: Train/test set overlap
Temperature Mismatch: Training uses temp=0.6, evaluation uses temp=0.0

Investigation Needed:

Analyze reward-answer correlation
Verify train/test set separation
Evaluate with matching temperature (0.6)
Check PRM quality on validation set
Perform ablation study on reward components

2. Resource Constraints

Issues:

OOM (Out of Memory) during training
PRM model loading failures
FlashOptimization compatibility issues
Multi-node coordination challenges

Solutions Applied:

Reduced batch size and gradient accumulation
Disabled flashinfer (use native PyTorch sampling)
CPU offloading for optimizer states
FSDP2 for efficient memory management
Environment variable fixes for Ray/vLLM

3. Training Efficiency

Current: 3-4 days for full dataset training

Optimization Opportunities:

Multi-GPU PRM inference
Better caching strategies
Mixed precision training optimization
Distributed data loading

Technical Stack

Frameworks

VERL: VolcEngine Reinforcement Learning framework
- Custom GRPO trainer implementation
- FSDP2 distributed training
- vLLM for fast inference
- GitHub: https://github.com/volcengine/verl
PyTorch: Deep learning framework
- Version: 2.0+
- Distributed training (NCCL)
- bfloat16 mixed precision
vLLM: High-throughput LLM inference
- Version: 0.6.6+
- Multi-GPU support
- KV cache optimization
- Flashinfer disabled (compatibility issues)
Transformers: HuggingFace transformers
- Model loading and tokenization
- Chat template support
Ray: Distributed computing
- Actor/Rollout worker coordination
- GPU resource management
- Multi-node training support

Environment

# Conda environment
conda activate grpo

# Key dependencies
- torch==2.0.1
- transformers>=4.40.0
- verl (custom build from source)
- vllm==0.6.6
- ray>=2.9.0
- pandas
- pyarrow
- numpy
- wandb (logging)
- tensorboard

Evaluation Framework

Math-Verify Integration

Located in Math-Verify/ directory:

Features:

Exact match evaluation
Format checking (\boxed{})
Numerical answer extraction
LaTeX expression parsing
Support for multiple test formats

Usage:

from MathVerify import eval_math

results = eval_math(
    model_path="outputs/hf_models/global_step_3360",
    test_file="data/processed/test/Math-500",
    batch_size=32,
    max_length=4096,
    temperature=0.0
)

Test Coverage:

MATH-500 (standard benchmark)
AIME 2024 (competition)
AMC 23 (competition)
MinervaMath (research benchmark)
GPQA Diamond (expert level)

Key Files Reference

Documentation

grpo-pro/README_DUAL_PRO.md - Dual-Pro system documentation
grpo-pro/H200_TRAINING_STATUS.md - H200 training status
grpo-pro/训练指标说明.md - Training metrics explanation (Chinese)

Reproducing the Results

Step 1: Environment Setup

# Create conda environment
conda create -n grpo python=3.11 -y
conda activate grpo

# Install dependencies
pip install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers>=4.40.0
pip install vllm==0.6.6
pip install ray>=2.9.0
pip install tensordict
pip install hydra-core omegaconf
pip install datasets pandas pyarrow
pip install flash-attn --no-build-isolation
pip install wandb tensorboard

# Install VERL framework
cd verl
pip install -e .
cd ..

Step 2: Download Models

# Download base model
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
    --local-dir grpo-pro/models/Qwen2.5-7B-Instruct

# Download PRM models
huggingface-cli download ReasonFlux/ReasonFlux-7B \
    --local-dir grpo-pro/models/ReasonFlux-7B

huggingface-cli download Qwen/Qwen2.5-Math-PRM-7B \
    --local-dir grpo-pro/models/Qwen2.5-Math-PRM-7B

Step 3: Prepare Data

cd grpo-pro
# Data is already processed in data/processed/
# Verify files exist
ls -lh data/processed/math_valid_2000.parquet

Step 4: Run Training

# Submit to SLURM (H200)
sbatch scripts/train_h200.sh

# Or run locally for testing
python train_h200_direct.py --config configs/test_small.yaml

Step 5: Evaluate

# After training completes
python scripts/evaluate_math_model.py \
    --checkpoint outputs/hf_models/global_step_3360 \
    --test_sets Math-500 AIME_2024

Future Work

High Priority

Investigate Performance Regression ⚠️
- Analyze reward-answer correlation
- Fix temperature mismatch
- Verify data integrity
- Perform ablation studies
Improve Reward Signal
- Align PRM scores with actual accuracy
- Add diversity rewards
- Implement dynamic reward weighting
- Experiment with different fusion strategies
Regularization
- Add dropout
- Implement early stopping
- Data augmentation
- Weight decay tuning

Medium Priority

Scale Training
- Multi-node training improvements
- Larger batch sizes
- More training data (full NuminaMath: 863k samples)
- Curriculum learning
Evaluation
- More test sets
- Ablation studies
- Comparison with baselines
- Human evaluation

Low Priority

Optimization
- Faster PRM inference
- Better caching
- Gradient checkpointing improvements
- Memory optimization

Troubleshooting

Common Issues

Q1: OOM during training

# Reduce batch size in config
batch_size_per_device: 2
# Enable CPU offloading
export FSDP_OFFLOAD_OPTIMIZER=1
# Reduce GPU memory utilization
gpu_memory_utilization: 0.3

Q2: PRM loading fails

# Check model paths
ls -lh models/ReasonFlux-7B/
ls -lh models/Qwen2.5-Math-PRM-7B/
# Verify with verify_setup.py
python verify_setup.py
# Force HuggingFace backend
use_vllm: false

Q3: Flashinfer errors

# Disable flashinfer
export VLLM_USE_FLASHINFER_SAMPLER=0
# Or patch with disable_flashinfer.py
python disable_flashinfer.py

Q4: Ray GPU visibility issues

# Allow all processes to see all GPUs
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
# Clear ROCR_VISIBLE_DEVICES
unset ROCR_VISIBLE_DEVICES

Q5: Training not converging

# Adjust learning rate
actor:
  optim:
    lr: 5e-7  # Lower learning rate
# Adjust reward weights
rewards:
  weights:
    final: 2.0  # Increase final answer weight
    proc: 0.3   # Decrease process reward weight

Q6: Multi-node job pending

# Check queue status
squeue -p accelerated-h100 --format="%10i %9P %20j %8u %2t %10M %6D %R"
# Check node resources
sinfo -p accelerated-h100
# Submit during off-peak hours (nights/weekends)

Q7: Resume from checkpoint

# Just resubmit the same script
sbatch scripts/train_dual_pro_2nodes.sh
# Script automatically detects latest checkpoint
ls -lht outputs/dual_pro_2nodes/global_step_*/

Citation

If you use this code or implementation, please cite:

@software{grpo_math_reasoning,
  title={Dual-Pro GRPO: Group Relative Policy Optimization for Mathematical Reasoning},
  author={GRPO Team},
  year={2026},
  url={https://github.com/your-repo/grpo}
}

License and Acknowledgments

VERL Framework: VolcEngine Reinforcement Learning (https://github.com/volcengine/verl)
Base Model: Qwen2.5-7B-Instruct (Alibaba)
PRM Models: ReasonFlux-7B, Qwen2.5-Math-PRM-7B
Evaluation: Math-Verify framework
Inspiration: DeepSeek-Math, OpenAI's Process Reward Models

Contact

Project Maintainer: tum_fmp0582@hk-project

Project Location:

/hkfs/work/workspace/scratch/tum_fmp0582-my_rl_ws/tum_fmp0582-dndworkspace-1766456703/GRPO

Last Updated: January 22, 2026

Appendix: Training Commands Quick Reference

# === Environment ===
conda activate grpo
cd grpo-pro

# === Training ===
sbatch scripts/train_h200.sh                    # Submit H200 training
sbatch scripts/train_dual_pro_2nodes.sh         # Submit 2-node training
squeue -u $USER                                 # Check job status

# === Monitoring ===
tail -f outputs/h200_single_node/train.log     # Monitor training
tail -f 训练记录_h200/training_details_step_*.jsonl  # Detailed logs
tensorboard --logdir tensorboard_log/           # TensorBoard
watch -n 1 nvidia-smi                          # GPU usage

# === Evaluation ===
bash scripts/evaluate_model.sh --base           # Evaluate baseline
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360

# === Debug ===
python verify_setup.py                         # Verify environment
bash check_training.sh                         # Check training status

Note: This project represents ongoing research into improving mathematical reasoning in LLMs through reinforcement learning. The performance regression observed in the latest training run highlights the complexity of aligning reward signals with actual task performance and is an active area of investigation. Contributions and suggestions are welcome!

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support