arxiv:2605.17531

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

Published on May 24

· Submitted by

Yang yuting on May 28

iSEE-Laboratory

Upvote

Authors:

Yuting Yang ,

Abstract

IC-Seg is an agentic framework that clarifies user intent through multi-turn conversations before segmentation, using a hierarchical optimization strategy to improve ambiguous query resolution in referring video object segmentation.

AI-generated summary

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

View arXiv page View PDF GitHub 9 Add to collection

Community

bbyyj

Paper author Paper submitter about 3 hours ago

IC-Seg resolves ambiguities via multi-turn dialogues with an MLLM-based User Simulator. Our Hi-GRPO algorithm empowers the agent through a hierarchical reward chain: supervising final localization accuracy at the trajectory level, inquiry quality at the turn level, and fine-grained reasoning steps using expert-diagnosed signals.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17531 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17531 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17531 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.