Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search
Abstract
Seg-ReSearch introduces a novel segmentation approach that combines interleaved reasoning with external search to overcome limitations of frozen MLLM knowledge, using hierarchical reward design for training and demonstrating superior performance on video object segmentation benchmarks.
Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose Seg-ReSearch, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.
Community
This work breaks the knowledge bottlenecks of MLLM-based segmentation models by integrating external search.
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/seg-research-segmentation-with-interleaved-reasoning-and-external-search-1562-3c67e347
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval (2026)
- CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction (2026)
- InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search (2025)
- DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models (2026)
- RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations (2025)
- ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying (2026)
- IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper