arxiv:2605.21625

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Published on May 20

· Submitted by

Aditya Chetan on Jun 1

Upvote

Authors:

Aditya Chetan ,

Abstract

Large Vision-Language Models demonstrate significant limitations in fine-grained spatio-temporal reasoning and tracking abilities when evaluated on a new furniture assembly benchmark.

AI-generated summary

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

justachetan

Paper author Paper submitter about 23 hours ago

We benchmark large vision-language models on complex multi-step furniture assembly videos to evaluate their spatio-temporal reasoning skills. We found that models lag far behind human performance, and struggle with object grounding and tracking parts through the video. We also found that models do not use videos effectively to reason about the questions, as a single image related to the question leads to similar performance as passing a full video on most of the tasks, even though human performance degrades on image-only prompts. Lastly, we also observe that task decomposition into tracking and contact reasoning also does not help, as specialized models for these simpler tasks also struggle in the assembly domain. For more details, check out our project site: https://flat-pack-bench.github.io/

librarian-bot

about 7 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.21625

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21625 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21625 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21625 in a Space README.md to link it from this page.