Papers
arxiv:2605.14040

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Published on May 13
· Submitted by
Shan Yang
on May 18
Authors:

Abstract

Three undocumented issues in multimodal-physics evaluation—train-eval contamination, translation drift, and MCQ saturation—are identified through comprehensive auditing, revealing significant distortions in vision-language reasoning measurement.

AI-generated summary

We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

Community

Paper author Paper submitter

TL;DR: Physics-R1 — RL post-training a 7B VLM for visual physics reasoning. Three-seed mean +18.9 pp on a novel-source physics olympiad eval (PhysOlym-A), +18.5 pp on PhysReason. The surprise: binary
reward beats dense by 17 pp
in our setup.

Three transferable findings:

  1. Reward shape > reward density. A clean binary correct/incorrect reward outperforms dense intermediate-reward shaping by +17.1 pp on PhysReason. We expected the opposite — dense rewards are the standard
    advice for hard reasoning tasks.
  2. Audited training data beats scraped scale. 2,434 manually audited problems (PhysR1Corp) outperform unaudited 10K+ corpora. Full audit methodology in §3.
  3. Scoring protocol matters as much as the model. Per-subpart liberal scoring inflates results by 10+ pp without measuring actual problem-solving. We use problem-level AND (every subpart correct, or it
    doesn't count). 3-seed σ stays under 3.3 pp across all 6 evals under this protocol.

Why PhysOlym-A is the headline benchmark: novel-source olympiad questions outside common training distributions. The +18.9 pp delta is robust where standard benchmarks may suffer contamination effects.

Artifacts:

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.14040
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 8

Browse 8 models citing this paper

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14040 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.