allenai/ai2_arc
Viewer • Updated • 7.79k • 463k • 337
Multi-benchmark GRPO hardening with 6-signal reward stack and self-healing callback.
| Benchmark | Dataset | Reward Type |
|---|---|---|
| GPQA Diamond | Wanfq/gpqa (198 PhD-level MCQs) |
Letter-match MCQ |
| ARC-Challenge | allenai/ai2_arc (1172 MCQs) |
Letter-match MCQ |
| IFEval | google/IFEval (541 IF prompts) |
Rule-based constraint verification |
| PHYBench | Eureka-Lab/PHYBench (500 physics) |
Numeric/symbolic match |
| Adversarial | Generated (injection + IF + tools) | Injection defense + graceful recovery |
Plateau → β×1.5 + LR×0.5 | Collapse → checkpoint + stop
pip install trl transformers torch datasets accelerate trackio
python benchmark_hardening.py