arxiv:2604.02870

Token Warping Helps MLLMs Look from Nearby Viewpoints

Published on Apr 3

· Submitted by

Authors:

Abstract

Token-level warping in vision-language models demonstrates superior stability and semantic coherence for viewpoint transformation compared to pixel-wise methods, achieving better visual reasoning performance.

AI-generated summary

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

View arXiv page View PDF Project page GitHub 10 Add to collection

Community

phillipinseoul

Paper submitter 1 day ago

CVPR 2026
Paper: https://arxiv.org/abs/2604.02870
Project Page: https://token-warping-mllm.github.io/
Code: https://github.com/KAIST-Visual-AI-Group/Token-Warping-MLLM

daehyeonchoi

1 day ago

Interesting

librarian-bot

about 5 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 3 hours ago

the backward token warping idea is the core of this work, building a dense target-view grid and fetching source tokens via a lightweight 3d proxy mesh, which keeps semantics far more stable than pixel-level warping. i’d still like to see how this handles heavy occlusion or nonuniform lighting where depth error is high, because those failure modes could reveal if token-level representations truly carry robust part-level structure. the arxivlens breakdown helped me parse the method steps and the retrieval loop without getting lost in patch noise (https://arxivlens.com/PaperView/Details/token-warping-helps-mllms-look-from-nearby-viewpoints-13-2d118584). would you run an ablation on grid density versus depth noise to isolate which source of error hurts performance more?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.02870

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.02870 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.02870 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.02870 in a Space README.md to link it from this page.