Papers
arxiv:2603.21944

Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Published on Mar 23
· Submitted by
youbin kim
on Mar 24
Authors:
,
,

Abstract

Group3D is a multi-view open-vocabulary 3D detection framework that integrates semantic constraints into instance construction through semantic compatibility groups, improving accuracy in pose-known and pose-free settings.

AI-generated summary

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.

Community

Paper author Paper submitter

We are excited to share our recent work "Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection".

TL;DR: We present Group3D, enabling robust open-vocabulary 3D object detection by integrating semantic constraints into multi-view 3D instance construction.

We leverage multimodal large language models (MLLMs) to build scene-adaptive vocabularies and semantic compatibility groups, which guide 3D fragment merging alongside geometric consistency. This semantically-aware merging mitigates over-merging and fragmentation, improving performance in multi-view RGB settings, including pose-free and zero-shot scenarios.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.21944 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.21944 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.21944 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.