Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Abstract
Three undocumented issues in multimodal-physics evaluation—train-eval contamination, translation drift, and MCQ saturation—are identified through comprehensive auditing, revealing significant distortions in vision-language reasoning measurement.
We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).
Community
TL;DR: Physics-R1 — RL post-training a 7B VLM for visual physics reasoning. Three-seed mean +18.9 pp on a novel-source physics olympiad eval (PhysOlym-A), +18.5 pp on PhysReason. The surprise: binary
reward beats dense by 17 pp in our setup.
Three transferable findings:
- Reward shape > reward density. A clean binary correct/incorrect reward outperforms dense intermediate-reward shaping by +17.1 pp on PhysReason. We expected the opposite — dense rewards are the standard
advice for hard reasoning tasks. - Audited training data beats scraped scale. 2,434 manually audited problems (PhysR1Corp) outperform unaudited 10K+ corpora. Full audit methodology in §3.
- Scoring protocol matters as much as the model. Per-subpart liberal scoring inflates results by 10+ pp without measuring actual problem-solving. We use problem-level AND (every subpart correct, or it
doesn't count). 3-seed σ stays under 3.3 pp across all 6 evals under this protocol.
Why PhysOlym-A is the headline benchmark: novel-source olympiad questions outside common training distributions. The +18.9 pp delta is robust where standard benchmarks may suffer contamination effects.
Artifacts:
- Training corpus: shanyangmie/physr1corp
- Eval benchmark: shanyangmie/physolym-a
- Raw eval outputs (all models, all judges): shanyangmie/physics-r1-eval-outputs
- Code: github.com/shanyang-me/physics-r1-neurips2026
- Project page: shanyang.me/physics-r1-page
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone (2026)
- SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning (2026)
- Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages (2026)
- OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model (2026)
- Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling (2026)
- MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation (2026)
- SF20K Competition 2025: Summary and findings (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.14040 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 8
shanyangmie/physics-r1-seed23-canonical-step60-fsdp
Datasets citing this paper 4
shanyangmie/physolym-a
shanyangmie/physcorp-a
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper