arxiv:2601.16207

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Published on Jan 22

· Submitted by

Authors:

Abstract

IVRA enhances spatial understanding in vision-language-action models by injecting affinity signals into language-model layers without retraining or external encoders.

AI-generated summary

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA

View arXiv page View PDF Add to collection

Community

jongwoopark7978

Paper submitter about 7 hours ago

IVRA is a training-free, inference-time drop-in that restores spatial structure in VLA models by injecting encoder affinity signals into selected LLM layers (no retraining, no extra parameters, ~3% latency). It generalizes across LLaRA, OpenVLA, and FLOWER, improving both 2D and 3D manipulation: +4.2% on VIMA (Novel Object) over LLaRA in a low-data regime, and consistent LIBERO gains over OpenVLA (76.5%→77.6%) and FLOWER even near saturation (96.3%→97.1%).

Paper: https://arxiv.org/abs/2601.16207
Project: https://jongwoopark7978.github.io/IVRA

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.16207 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.16207 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.16207 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.