arxiv:2605.14040

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Published on May 13

· Submitted by

Shan Yang on May 18

Upvote

Authors:

Shan Yang

Abstract

Three undocumented issues in multimodal-physics evaluation—train-eval contamination, translation drift, and MCQ saturation—are identified through comprehensive auditing, revealing significant distortions in vision-language reasoning measurement.

AI-generated summary

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

shanyangmie

Paper author Paper submitter 1 day ago

TL;DR: Physics-R1 — RL post-training a 7B VLM for visual physics reasoning. Three-seed mean +18.9 pp on a novel-source physics olympiad eval (PhysOlym-A), +18.5 pp on PhysReason. The surprise: binary
reward beats dense by 17 pp in our setup.

Three transferable findings:

Reward shape > reward density. A clean binary correct/incorrect reward outperforms dense intermediate-reward shaping by +17.1 pp on PhysReason. We expected the opposite — dense rewards are the standard
advice for hard reasoning tasks.
Audited training data beats scraped scale. 2,434 manually audited problems (PhysR1Corp) outperform unaudited 10K+ corpora. Full audit methodology in §3.
Scoring protocol matters as much as the model. Per-subpart liberal scoring inflates results by 10+ pp without measuring actual problem-solving. We use problem-level AND (every subpart correct, or it
doesn't count). 3-seed σ stays under 3.3 pp across all 6 evals under this protocol.

Why PhysOlym-A is the headline benchmark: novel-source olympiad questions outside common training distributions. The +18.9 pp delta is robust where standard benchmarks may suffer contamination effects.

Artifacts: