Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment
Abstract
Vision Language Models struggle with aligning assembly diagrams and video feeds due to a depiction gap, with findings indicating visual encoding as the primary target for improving cross-depiction robustness.
2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/
Community
We introduce IKEA-Bench, the first cross-depiction alignment benchmark for 2D-manual-based assembly guidance. Our benchmark comprises 1,623 questions across 6 task types on 29 IKEA furniture products, designed to systematically evaluate whether VLMs can align abstract schematic assembly diagrams with photorealistic assembly video — a fundamental capability for mixed-reality assembly assistants.
We evaluate 19 VLMs (2B–38B), including Gemini 3.1 Pro and Gemini 3 Flash, under three alignment strategies (Visual, Visual+Text, Text Only). Our key findings are: (1) text descriptions can recover assembly instruction understanding (+23.6 pp on diagram ordering), but counter-intuitively hurt diagram-to-video alignment (−3.1 pp on step recognition); (2) architecture family predicts alignment accuracy more reliably than parameter count — at ~8B scale, each Qwen generation yields larger gains than tripling parameters within a generation; (3) video understanding remains a hard bottleneck that no alignment strategy overcomes, with even the best proprietary model reaching only 71.1%.
To understand why models fail, we conduct a three-layer mechanistic analysis probing ViT representations, LLM hidden states, and attention routing. We find that diagrams and video occupy completely disjoint subspaces at the ViT level (CKA ≈ 0), and that adding text actively redirects both hidden-state reliance and attention away from visual modalities rather than supplementing them. Our results localize the bottleneck to the visual encoder and motivate cross-depiction contrastive pre-training as a direct path to closing this gap.
Get this paper in your agent:
hf papers read 2604.00913 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper