HuggingFaceH4/MATH-500
Viewer • Updated • 500 • 175k • 308
Post-warmup Phase B (iter1) checkpoint used as the starting point for the GRPO compression run at https://huggingface.co/LauraGG/qwen25math-7b-abstract-cot-grpo
Trained with 1 policy-iteration round (Phase A + Phase B), 3k examples from Dolci-Think-SFT-7B, 1 epoch each, full fine-tuning, on 1× H100. ~70 minutes.
Post-warmup MATH-500 cold-start probe: 15.6% (n=32, T=0.7).