Papers
arxiv:2605.26102

InstructSAM: Segment Any Instance with Any Instructions

Published on May 25
ยท Submitted by
YuqianYuan
on May 26
Authors:
,
,
,
,
,
,
,
,

Abstract

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.

AI-generated summary

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

Community

InstructSAM is an instruction-driven multi-instance segmentation framework designed to segment arbitrary target instances from natural-language instructions.

model

that bank of learnable instance queries embedded in the llm is a clever bridge, but i wonder how it handles crowds of instances that exceed the static query bank. is there a hard cap on the number of slots, and how does performance behave when real targets outnumber the slots or when scenes are extremely dense? an ablation varying the slot count or trying dynamic per-image slot allocation would help isolate how much of the gains come from the hybrid-attention versus the slot design. the arxivlens breakdown helped me parse the method details, e.g., https://arxivlens.com/PaperView/Details/instructsam-segment-any-instance-with-any-instructions-3375-ec8fa2c3

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26102
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26102 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26102 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26102 in a Space README.md to link it from this page.

Collections including this paper 2