arxiv:2604.00913

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Published on Apr 1

· Submitted by

Zhuchenyang Liu on Apr 2

Upvote

Authors:

Zhuchenyang Liu ,

Abstract

Vision Language Models struggle with aligning assembly diagrams and video feeds due to a depiction gap, with findings indicating visual encoding as the primary target for improving cross-depiction robustness.

AI-generated summary

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Ryenhails

Paper author Paper submitter about 13 hours ago

We introduce IKEA-Bench, the first cross-depiction alignment benchmark for 2D-manual-based assembly guidance. Our benchmark comprises 1,623 questions across 6 task types on 29 IKEA furniture products, designed to systematically evaluate whether VLMs can align abstract schematic assembly diagrams with photorealistic assembly video — a fundamental capability for mixed-reality assembly assistants.

We evaluate 19 VLMs (2B–38B), including Gemini 3.1 Pro and Gemini 3 Flash, under three alignment strategies (Visual, Visual+Text, Text Only). Our key findings are: (1) text descriptions can recover assembly instruction understanding (+23.6 pp on diagram ordering), but counter-intuitively hurt diagram-to-video alignment (−3.1 pp on step recognition); (2) architecture family predicts alignment accuracy more reliably than parameter count — at ~8B scale, each Qwen generation yields larger gains than tripling parameters within a generation; (3) video understanding remains a hard bottleneck that no alignment strategy overcomes, with even the best proprietary model reaching only 71.1%.

To understand why models fail, we conduct a three-layer mechanistic analysis probing ViT representations, LLM hidden states, and attention routing. We find that diagrams and video occupy completely disjoint subspaces at the ViT level (CKA ≈ 0), and that adding text actively redirects both hidden-state reliance and attention away from visual modalities rather than supplementing them. Our results localize the bottleneck to the visual encoder and motivate cross-depiction contrastive pre-training as a direct path to closing this gap.

Ryenhails

Paper author about 13 hours ago

welcome to explore our project: https://ryenhails.github.io/IKEA-Bench/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.00913

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.00913 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.00913 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.