YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
GRPO: Group Relative Policy Optimization for Mathematical Reasoning
Project Overview
This project implements Dual-Pro GRPO (Group Relative Policy Optimization), an advanced reinforcement learning framework designed to enhance large language models' mathematical reasoning capabilities through innovative reward mechanisms and training strategies.
Key Innovation: Dual-Process Reward Models with Progressive Rewards and Correct Prefix Truncation
Current Status (January 22, 2026)
Latest Training Results ✅
- Final Checkpoint:
global_step_3360(H200 training completed) - Training Reward: 0.420 (+28% improvement from initial 0.328)
- Training Dataset: 1,800 math Olympiad problems (Math3to5)
- Training Duration: ~48 hours on 4× H200 GPUs
- Training Date: January 20-22, 2026
Evaluation Results (Step 3360)
| Test Set | Samples | Accuracy | Baseline | Change |
|---|---|---|---|---|
| MATH-500 | 500 | 46.6% | 59.8% | -13.2% ⚠️ |
| MinervaMath | 272 | 13.97% | 20.2% | -6.23% ⚠️ |
| GPQA Diamond | 198 | 5.56% | 7.6% | -2.04% ⚠️ |
| AMC 23 | 40 | 0.0% | 0.0% | 0% |
| AIME 2024 | 30 | 0.0% | 0.0% | 0% |
⚠️ Known Issue: Despite training reward improvement (+28%), test performance decreased (-6% to -13%). This indicates potential overfitting or reward signal mismatch. See Challenges section for details.
Training Progress Timeline
| Step | Time | Reward | Notes |
|---|---|---|---|
| 100 | 2026-01-20 18:10 | 0.328 | Initial checkpoint |
| 500 | 2026-01-20 19:17 | - | Early training |
| 1000 | 2026-01-21 04:08 | - | - |
| 2000 | 2026-01-21 15:13 | - | Mid-training |
| 3000 | 2026-01-21 21:20 | - | - |
| 3360 | 2026-01-22 01:19 | 0.420 | Final checkpoint |
Core Innovations
1. Dual-Verifier PRM (Process Reward Model)
Strategy: Uncertainty Arbitrator with Conservative Fusion
Models:
- ReasonFlux-PRM-7B: GPU 2
- Qwen2.5-Math-PRM-7B: GPU 3
Fusion Method:
# Conservative scoring: p_t = min(model1_score, model2_score)
p_t = min(ReasonFlux(C_t), QwenPRM(C_t))
# Prefix truncation: L = max{t | ∀k≤t, p_k ≥ τ_ok}
L = max_step_where_all_scores_above_threshold(tau_ok=0.7)
# Min-form value: v_t = min_{k=t}^{L} p_k
# Aggregation: R_proc = (1/L) * Σ_{t=1}^{L} v_t
Accuracy: 92.22% on validation set
Fusion Strategy Rankings (from 3540-problem experiment):
| Strategy | Accuracy | F1 Score | Recommendation |
|---|---|---|---|
| uncertainty_arbitrator | 0.9222 | 0.9579 | ✅ Best |
| weighted_0.3_0.7 | 0.9193 | 0.9565 | ✅ Good |
| geometric_mean | 0.9174 | 0.9556 | ✅ Good |
| min | 0.6818 | 0.8010 | ❌ Too conservative |
Uncertainty Arbitrator Algorithm:
std = standard_deviation([p_t^1, p_t^2])
if std < 0.1: # High consistency
p_t = (p_t^1 + p_t^2) / 2 # Arithmetic mean
elif std > 0.3: # Low consistency (high uncertainty)
# Conservative weighting: bias toward lower score
p_t = 0.3 * max(p_t^1, p_t^2) + 0.7 * min(p_t^1, p_t^2)
else: # Medium consistency
# Linear interpolation
p_t = α * conservative_fused + (1-α) * arithmetic_mean
Consensus Truncation (recommended):
# Only truncate if BOTH PRMs think step is wrong
if p_t^1 < 0.4 and p_t^2 < 0.4:
truncate_at_step_t
else:
keep_step_t # Avoid false positives
2. Progressive Reward System
Core Idea: Reward based on target answer probability increments during reasoning
Algorithm:
# Calculate probability increment at each step
Δq_t = q_t - q_{t-1}
# Segmented reward function
if Δq_t > ε: # Effective progress
r_t = α × Δq_t
elif |Δq_t| ≤ ε: # Stagnation
r_t = -β
else: # Regression (Δq_t < -ε)
r_t = -γ × |Δq_t|
Parameters:
- ε = 0.01 (progress threshold)
- α = 0.5 (progress reward coefficient)
- β = 0.005 (stagnation penalty)
- γ = 0.1 (regression penalty coefficient)
Optimization: KV Cache enabled for faster probability computation
3. Correct Prefix Truncation
Strategy: Truncate at first error step identified by PRM
Implementation:
- Treat correct prefixes as positive examples
- Penalize continued generation after truncation point
- Prevents model from learning from incorrect reasoning chains
- Avoid placing EOS token at truncation point (model learns "unfinished" state)
4. Auxiliary Rewards
- Format Reward (r_fmt): Check for
\boxed{}format (weight: 0.1) - Waste Penalty (r_waste): Penalize generation after truncation (weight: 0.05)
- Calibration Penalty (r_cal): Penalize overconfidence (weight: 0.05)
5. Total Reward Formula
R_total = w_proc * R_proc + w_prog * R_prog + w_final * R_final
+ w_fmt * R_fmt + w_waste * R_waste + w_cal * R_cal
Default weights:
- w_final = 1.0 (final answer correctness)
- w_proc = 0.5 (process reward)
- w_prog = 0.2 (progressive reward)
- w_fmt = 0.1 (format reward)
- w_waste = 0.05 (waste penalty)
- w_cal = 0.05 (calibration penalty)
Project Structure
GRPO/
├── grpo-pro/ # Main implementation directory
│ ├── configs/ # Configuration files
│ │ ├── dual_pro_v3.yaml # Main training config (v3.0)
│ │ └── test_small.yaml # Small dataset test config
│ ├── src/ # Source code
│ │ ├── rewards/ # Reward system (core)
│ │ │ ├── dual_prm.py # Dual-PRM implementation
│ │ │ ├── progressive.py # Progressive reward system
│ │ │ ├── reward_manager.py # Reward orchestration
│ │ │ └── auxiliary.py # Auxiliary rewards
│ │ ├── data/ # Data processing
│ │ │ └── math_dataset.py # Data loading pipeline
│ │ └── metrics/ # Monitoring metrics
│ │ └── monitoring.py # Training monitoring
│ ├── scripts/ # Running scripts
│ │ ├── train_dual_pro.sh # Single-node training script
│ │ └── train_h200.sh # H200 training script
│ ├── data/ # Data directory
│ │ ├── processed/ # Processed datasets
│ │ │ ├── math_valid_2000.parquet # 1,800 training samples
│ │ │ ├── math_olympiads_aime_train.parquet
│ │ │ └── test/ # Test datasets
│ │ │ ├── Math-500/
│ │ │ ├── AIME_2024/
│ │ │ ├── AMC23/
│ │ │ ├── GPQA_diamond/
│ │ │ └── MinervaMath/
│ │ ├── train/ # Training data
│ │ └── val/ # Validation data
│ ├── outputs/ # Training outputs
│ │ ├── h200_single_node/ # H200 training outputs
│ │ │ └── global_step_*/ # Checkpoints (every 100 steps)
│ │ │ └── global_step_3360/ # Final checkpoint
│ │ └── hf_models/ # HuggingFace format models
│ │ └── global_step_3360/ # Final model in HF format
│ ├── evaluation_results/ # Evaluation summaries
│ │ ├── base_model/ # Baseline model results
│ │ ├── base_model_h100/ # Baseline (H100) results
│ │ ├── step_3360_final/ # Final checkpoint results
│ │ │ └── checkpoint_global_step_3360_accelerated/
│ │ │ ├── Math-500/ # 46.6% accuracy
│ │ │ ├── AIME_2024/ # 0% accuracy
│ │ │ ├── MinervaMath/ # 13.97% accuracy
│ │ │ └── GPQA_diamond/ # 5.56% accuracy
│ ├── logs/ # Training logs
│ ├── 训练记录_h200/ # H200 training records (Chinese)
│ ├── tensorboard_log/ # TensorBoard logs
│ ├── 训练曲线/ # Training curve visualizations
│ ├── 训练记录/ # General training records
│ ├── models/ # Model directory
│ │ ├── Qwen2.5-7B-Instruct/ # Policy model
│ │ ├── ReasonFlux-7B/ # PRM model 1
│ │ └── Qwen2.5-Math-PRM-7B/ # PRM model 2
│ ├── train_dual_pro.py # Training entry point
│ ├── train_h200_direct.py # H200-specific training
│ ├── verify_setup.py # Environment verification
│ ├── README_DUAL_PRO.md # Detailed Dual-Pro documentation
│ ├── H200_TRAINING_STATUS.md # H200 training status report
│ └── 训练指标说明.md # Training metrics explanation
├── verl/ # VERL framework (RL framework)
│ ├── verl/
│ │ ├── trainer/ # GRPO/PPO trainers
│ │ │ └── ppo/ # PPO/GRPO implementation
│ │ ├── workers/ # Actor, Rollout, Reward workers
│ │ ├── protocols/ # Data protocols
│ │ └── utils/ # Utility functions
│ └── examples/ # Example configurations
├── Math-Verify/ # Mathematical evaluation framework
│ ├── eval/
│ │ ├── math/ # Math evaluation
│ │ └── format_utils.py # Format checking
│ └── test_sets/ # Test datasets
├── QUICKSTART.md # Quick start guide
└── README.md # This file
Training Configuration
Hardware
Single Node Training (H200):
- GPUs: 4× H200 (or A100)
- Memory: 256 GB RAM
- Partition:
accelerated-h200 - Time Limit: 48 hours
GPU Allocation:
GPU 0-1: Actor training (FSDP2, 2 GPUs)
GPU 2: ReasonFlux-PRM-7B
GPU 3: Qwen2.5-Math-PRM-7B
Multi-Node Training (2× H100):
- Nodes: 2 H100 nodes
- Total GPUs: 8 (4 per node)
- Time Limit: 48 hours per session
- Auto-resume: Yes (from latest checkpoint)
Model Configuration
model:
base_model: "Qwen/Qwen2.5-7B-Instruct"
dtype: "bfloat16"
enable_gradient_checkpointing: true
use_ema_reference: true
ema_decay: 0.999
Training Hyperparameters
training:
learning_rate: 1.0e-6
min_learning_rate: 1.0e-7
warmup_ratio: 0.1
weight_decay: 0.01
batch_size_per_device: 4
gradient_accumulation_steps: 4
total_epochs: 30 # H200 training
max_grad_norm: 1.0
generation:
num_generations: 16 # Group Size (G)
temperature: 0.6
top_p: 0.95
max_new_tokens: 1024
Reward Weights
rewards:
weights:
proc: 0.5 # Process reward weight
prog: 0.2 # Progressive reward weight
final: 1.0 # Final answer reward (dominant)
fmt: 0.1 # Format reward
waste: 0.05 # Waste penalty
cal: 0.05 # Calibration penalty
PRM Configuration
rewards:
prm:
fusion_strategy: "uncertainty_arbitrator" # Best strategy (0.9222 accuracy)
tau_ok: 0.7 # Threshold for valid step
use_vllm: false # Use HuggingFace
truncation_mode: "consensus" # Consensus truncation
consensus_threshold: 0.4 # Both PRMs < 0.4 to truncate
# Uncertainty Arbitrator parameters
uncertainty_low_threshold: 0.1 # std < 0.1 → high consistency
uncertainty_high_threshold: 0.3 # std > 0.3 → low consistency
conservative_weight: 0.3 # Conservative fusion weight
Datasets
Training Data Summary (Total: 901,676 samples)
| Dataset | Samples | Source | Difficulty | Usage |
|---|---|---|---|---|
| NuminaMath | 863,760 | AI-MO | Mixed | Pretraining |
| Math3to5 | 18,328 | chenggong1995 | Level 3-5 | Main Training |
| MATH | 5,000 | hendrycks | Level 1-5 | Standard benchmark |
| OlympiadBench | 1,389 | OlympiadBench | Competition | Competition level |
| GSM8K | 8,792 | openai | Easy | Grade school math |
| AMC23 | 40 | math-ai | Hard | Competition |
| AIME | 30 | math-ai | Very Hard | Competition |
Current Training Set
- File: data/processed/math_valid_2000.parquet
- Samples: 1,800 problems (filtered from Math3to5)
- Format: Parquet with columns:
problem,solution,answer,uid,source,level,type - Source: Math Olympiad problems (Level 3-5)
Test Sets
Located in data/processed/test/:
- Math-500/: MATH benchmark (500 samples) - Standard evaluation
- AIME_2024/: AIME 2024 problems (30 samples) - Competition level
- math-ai--amc23/: AMC 23 problems (40 samples) - Competition
- MinervaMath/: Minerva math test set (272 samples) - Research benchmark
- GPQA_diamond/: GPQA diamond level (198 samples) - Expert level
- OlympiadBench_maths_en/: Olympiad math problems - Competition
Training Pipeline
1. Data Processing
# Raw JSONL → Preprocessing → Parquet → Train/Val Split
python scripts/preprocess_data.py \
--input data/raw/math3to5.jsonl \
--output data/processed/math_valid_2000.parquet \
--format parquet
2. Training Submission
Single Node (H200):
cd grpo-pro
sbatch scripts/train_h200.sh
Multi-Node (2× H100):
cd grpo-pro
# Submit training job (auto-resume from checkpoint)
sbatch scripts/train_dual_pro_2nodes.sh
3. Training Monitoring
# Check job status
squeue -u $USER
# Real-time training log
tail -f outputs/h200_single_node/train.log
# Check GPU utilization
watch -n 1 nvidia-smi
# View detailed metrics
tail -f 训练记录_h200/training_details_step_*.jsonl
# TensorBoard
tensorboard --logdir tensorboard_log/ --port 6006
4. Evaluation
# Evaluate checkpoint
python scripts/evaluate_math_model.py \
--checkpoint outputs/hf_models/global_step_3360 \
--test_file data/processed/test/Math-500 \
--output evaluation_results/step_3360_final/
# Or use the wrapper script
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360
Monitored Metrics
Key Metrics
| Metric | Meaning | Target | Current Value |
|---|---|---|---|
| Pass@1 | Single-shot accuracy | ↑ Increase | - |
| VLR (Valid Length Ratio) | Effective reasoning length | ↑ Increase | - |
| ISR (Ineffective Step Ratio) | Low-quality step ratio | ↓ Decrease | - |
| Self-Deception | Gap: |q_final - Pass@1| | ↓ Decrease | - |
| R_proc_avg | Average process reward | ↑ Increase | 0.420 (final) |
| R_prog_avg | Average progressive reward | ↑ Increase | - |
| PRM Agreement | Correlation between PRMs | → Monitor | - |
TensorBoard Visualization
tensorboard --logdir grpo-pro/tensorboard_log --port 6006
Panels:
training/loss: Actor loss curvetraining/reward: Average rewardreward/r_proc: Process rewardreward/r_prog: Progressive rewardmetrics/pass_at_1: Pass ratemetrics/vlr: Valid length ratiotruncation/ratio: Truncation ratio
Challenges and Issues
1. Performance Regression ⚠️ Critical
Problem: Training reward increased by 28%, but test accuracy decreased by 6-13%
Current Status:
- Training reward: 0.328 → 0.420 (+28%)
- MATH-500: 59.8% → 46.6% (-13.2%)
- MinervaMath: 20.2% → 13.97% (-6.23%)
- GPQA Diamond: 7.6% → 5.56% (-2.04%)
Possible Causes:
- Overfitting: Model memorized training patterns
- Reward Signal Mismatch: PRM scores don't correlate with actual correctness
- Data Leakage: Train/test set overlap
- Temperature Mismatch: Training uses temp=0.6, evaluation uses temp=0.0
Investigation Needed:
- Analyze reward-answer correlation
- Verify train/test set separation
- Evaluate with matching temperature (0.6)
- Check PRM quality on validation set
- Perform ablation study on reward components
2. Resource Constraints
Issues:
- OOM (Out of Memory) during training
- PRM model loading failures
- FlashOptimization compatibility issues
- Multi-node coordination challenges
Solutions Applied:
- Reduced batch size and gradient accumulation
- Disabled flashinfer (use native PyTorch sampling)
- CPU offloading for optimizer states
- FSDP2 for efficient memory management
- Environment variable fixes for Ray/vLLM
3. Training Efficiency
Current: 3-4 days for full dataset training
Optimization Opportunities:
- Multi-GPU PRM inference
- Better caching strategies
- Mixed precision training optimization
- Distributed data loading
Technical Stack
Frameworks
VERL: VolcEngine Reinforcement Learning framework
- Custom GRPO trainer implementation
- FSDP2 distributed training
- vLLM for fast inference
- GitHub: https://github.com/volcengine/verl
PyTorch: Deep learning framework
- Version: 2.0+
- Distributed training (NCCL)
- bfloat16 mixed precision
vLLM: High-throughput LLM inference
- Version: 0.6.6+
- Multi-GPU support
- KV cache optimization
- Flashinfer disabled (compatibility issues)
Transformers: HuggingFace transformers
- Model loading and tokenization
- Chat template support
Ray: Distributed computing
- Actor/Rollout worker coordination
- GPU resource management
- Multi-node training support
Environment
# Conda environment
conda activate grpo
# Key dependencies
- torch==2.0.1
- transformers>=4.40.0
- verl (custom build from source)
- vllm==0.6.6
- ray>=2.9.0
- pandas
- pyarrow
- numpy
- wandb (logging)
- tensorboard
Evaluation Framework
Math-Verify Integration
Located in Math-Verify/ directory:
Features:
- Exact match evaluation
- Format checking (
\boxed{}) - Numerical answer extraction
- LaTeX expression parsing
- Support for multiple test formats
Usage:
from MathVerify import eval_math
results = eval_math(
model_path="outputs/hf_models/global_step_3360",
test_file="data/processed/test/Math-500",
batch_size=32,
max_length=4096,
temperature=0.0
)
Test Coverage:
- MATH-500 (standard benchmark)
- AIME 2024 (competition)
- AMC 23 (competition)
- MinervaMath (research benchmark)
- GPQA Diamond (expert level)
Key Files Reference
Configuration Files
- grpo-pro/configs/dual_pro_v3.yaml - Main training configuration
- grpo-pro/configs/test_small.yaml - Small dataset test
Training Scripts
- grpo-pro/train_dual_pro.py - Main training entry point
- grpo-pro/train_h200_direct.py - H200-specific training
- grpo-pro/scripts/train_dual_pro.sh - SLURM submission script
Source Code
- grpo-pro/src/rewards/dual_prm.py - Dual-PRM implementation
- grpo-pro/src/rewards/progressive.py - Progressive reward system
- grpo-pro/src/rewards/reward_manager.py - Reward orchestration
Documentation
- grpo-pro/README_DUAL_PRO.md - Dual-Pro system documentation
- grpo-pro/H200_TRAINING_STATUS.md - H200 training status
- grpo-pro/训练指标说明.md - Training metrics explanation (Chinese)
Reproducing the Results
Step 1: Environment Setup
# Create conda environment
conda create -n grpo python=3.11 -y
conda activate grpo
# Install dependencies
pip install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers>=4.40.0
pip install vllm==0.6.6
pip install ray>=2.9.0
pip install tensordict
pip install hydra-core omegaconf
pip install datasets pandas pyarrow
pip install flash-attn --no-build-isolation
pip install wandb tensorboard
# Install VERL framework
cd verl
pip install -e .
cd ..
Step 2: Download Models
# Download base model
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
--local-dir grpo-pro/models/Qwen2.5-7B-Instruct
# Download PRM models
huggingface-cli download ReasonFlux/ReasonFlux-7B \
--local-dir grpo-pro/models/ReasonFlux-7B
huggingface-cli download Qwen/Qwen2.5-Math-PRM-7B \
--local-dir grpo-pro/models/Qwen2.5-Math-PRM-7B
Step 3: Prepare Data
cd grpo-pro
# Data is already processed in data/processed/
# Verify files exist
ls -lh data/processed/math_valid_2000.parquet
Step 4: Run Training
# Submit to SLURM (H200)
sbatch scripts/train_h200.sh
# Or run locally for testing
python train_h200_direct.py --config configs/test_small.yaml
Step 5: Evaluate
# After training completes
python scripts/evaluate_math_model.py \
--checkpoint outputs/hf_models/global_step_3360 \
--test_sets Math-500 AIME_2024
Future Work
High Priority
Investigate Performance Regression ⚠️
- Analyze reward-answer correlation
- Fix temperature mismatch
- Verify data integrity
- Perform ablation studies
Improve Reward Signal
- Align PRM scores with actual accuracy
- Add diversity rewards
- Implement dynamic reward weighting
- Experiment with different fusion strategies
Regularization
- Add dropout
- Implement early stopping
- Data augmentation
- Weight decay tuning
Medium Priority
Scale Training
- Multi-node training improvements
- Larger batch sizes
- More training data (full NuminaMath: 863k samples)
- Curriculum learning
Evaluation
- More test sets
- Ablation studies
- Comparison with baselines
- Human evaluation
Low Priority
- Optimization
- Faster PRM inference
- Better caching
- Gradient checkpointing improvements
- Memory optimization
Troubleshooting
Common Issues
Q1: OOM during training
# Reduce batch size in config
batch_size_per_device: 2
# Enable CPU offloading
export FSDP_OFFLOAD_OPTIMIZER=1
# Reduce GPU memory utilization
gpu_memory_utilization: 0.3
Q2: PRM loading fails
# Check model paths
ls -lh models/ReasonFlux-7B/
ls -lh models/Qwen2.5-Math-PRM-7B/
# Verify with verify_setup.py
python verify_setup.py
# Force HuggingFace backend
use_vllm: false
Q3: Flashinfer errors
# Disable flashinfer
export VLLM_USE_FLASHINFER_SAMPLER=0
# Or patch with disable_flashinfer.py
python disable_flashinfer.py
Q4: Ray GPU visibility issues
# Allow all processes to see all GPUs
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1
# Clear ROCR_VISIBLE_DEVICES
unset ROCR_VISIBLE_DEVICES
Q5: Training not converging
# Adjust learning rate
actor:
optim:
lr: 5e-7 # Lower learning rate
# Adjust reward weights
rewards:
weights:
final: 2.0 # Increase final answer weight
proc: 0.3 # Decrease process reward weight
Q6: Multi-node job pending
# Check queue status
squeue -p accelerated-h100 --format="%10i %9P %20j %8u %2t %10M %6D %R"
# Check node resources
sinfo -p accelerated-h100
# Submit during off-peak hours (nights/weekends)
Q7: Resume from checkpoint
# Just resubmit the same script
sbatch scripts/train_dual_pro_2nodes.sh
# Script automatically detects latest checkpoint
ls -lht outputs/dual_pro_2nodes/global_step_*/
Citation
If you use this code or implementation, please cite:
@software{grpo_math_reasoning,
title={Dual-Pro GRPO: Group Relative Policy Optimization for Mathematical Reasoning},
author={GRPO Team},
year={2026},
url={https://github.com/your-repo/grpo}
}
License and Acknowledgments
- VERL Framework: VolcEngine Reinforcement Learning (https://github.com/volcengine/verl)
- Base Model: Qwen2.5-7B-Instruct (Alibaba)
- PRM Models: ReasonFlux-7B, Qwen2.5-Math-PRM-7B
- Evaluation: Math-Verify framework
- Inspiration: DeepSeek-Math, OpenAI's Process Reward Models
Contact
Project Maintainer: tum_fmp0582@hk-project
Project Location:
/hkfs/work/workspace/scratch/tum_fmp0582-my_rl_ws/tum_fmp0582-dndworkspace-1766456703/GRPO
Last Updated: January 22, 2026
Appendix: Training Commands Quick Reference
# === Environment ===
conda activate grpo
cd grpo-pro
# === Training ===
sbatch scripts/train_h200.sh # Submit H200 training
sbatch scripts/train_dual_pro_2nodes.sh # Submit 2-node training
squeue -u $USER # Check job status
# === Monitoring ===
tail -f outputs/h200_single_node/train.log # Monitor training
tail -f 训练记录_h200/training_details_step_*.jsonl # Detailed logs
tensorboard --logdir tensorboard_log/ # TensorBoard
watch -n 1 nvidia-smi # GPU usage
# === Evaluation ===
bash scripts/evaluate_model.sh --base # Evaluate baseline
bash scripts/evaluate_model.sh --checkpoint outputs/hf_models/global_step_3360
# === Debug ===
python verify_setup.py # Verify environment
bash check_training.sh # Check training status
Note: This project represents ongoing research into improving mathematical reasoning in LLMs through reinforcement learning. The performance regression observed in the latest training run highlights the complexity of aligning reward signals with actual task performance and is an active area of investigation. Contributions and suggestions are welcome!