arxiv:2605.26102

InstructSAM: Segment Any Instance with Any Instructions

Published on May 25

· Submitted by

Authors:

Abstract

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.

AI-generated summary

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

View arXiv page View PDF GitHub 17 Add to collection

Community

CircleRadon

Paper submitter about 15 hours ago

InstructSAM is an instruction-driven multi-instance segmentation framework designed to segment arbitrary target instances from natural-language instructions.

CircleRadon

Paper submitter about 15 hours ago

avahal

about 1 hour ago

that bank of learnable instance queries embedded in the llm is a clever bridge, but i wonder how it handles crowds of instances that exceed the static query bank. is there a hard cap on the number of slots, and how does performance behave when real targets outnumber the slots or when scenes are extremely dense? an ablation varying the slot count or trying dynamic per-image slot allocation would help isolate how much of the gains come from the hybrid-attention versus the slot design. the arxivlens breakdown helped me parse the method details, e.g., https://arxivlens.com/PaperView/Details/instructsam-segment-any-instance-with-any-instructions-3375-ec8fa2c3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26102

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26102 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26102 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26102 in a Space README.md to link it from this page.