arxiv:2603.07109

Vision Language Models Cannot Reason About Physical Transformation

Published on Mar 7

Authors:

Abstract

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with visual content. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.07109 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07109 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07109 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.