language:
- en
license: apache-2.0
tags:
- MLLM
pipeline_tag: image-segmentation
library_name: transformers
β¨X-SAM
From Segment Anything to Any Segmentation
Hao Wang1,2,Limeng Qiao3,Zequn Jie3, Zhijian Huang1, Chengjian Feng3,
Qingfang Zheng1, Lin Ma3, Xiangyuan Lan2 π§, Xiaodan Liang1 π§
1 Sun Yat-sen University, 2 Peng Cheng Laboratory, 3 Meituan Inc.
π§ Corresponding author
:boom: Updates
2025-08-06: Released the Technical Report.2025-08-05: Released the Model Weights.2025-07-26: Released the Online Demo.
:rocket: Introduction
X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from segment anything to any segmentation, thereby enhancing pixel-level perceptual understanding.
X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.
:sparkles: HIGHLIGHT: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs.
If you have any questions, please feel free to open an issue or contact me.
:bookmark: Abstract
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL .
π» Usage
This model can be used with the Hugging Face transformers library.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch
# Load model and processor. Ensure you have `bfloat16` support or adjust `torch_dtype`.
model_id = "hao9610/X-SAM"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)
# Move model to GPU if available
if torch.cuda.is_available():
model = model.to("cuda")
# Example image and text prompt for Visual Grounded Segmentation
# Replace "path/to/your/image.jpg" with an actual image file path
# For a sample image, you can download one from the project's GitHub repo, e.g.,
# https://github.com/wanghao9610/X-SAM/blob/main/docs/images/xsam_framework.png
# and save it as "example_image.png"
image = Image.open("path/to/your/image.jpg").convert("RGB")
prompt = "Segment all instances in this image and provide their bounding box coordinates."
# Prepare messages for the model's chat template
messages = [
{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
]
# Apply chat template and process inputs
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text_input], images=[image], return_tensors="pt")
# Move inputs to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
# Decode the generated text
# The output will include special tokens for bounding boxes (e.g., <box>(x1,y1,x2,y2)</box>)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
print(generated_text)
# Expected output might look like: "object1 <box>(x1,y1,x2,y2)</box> object2 <box>(x1,y1,x2,y2)</box>"
:mag: Overview
:bar_chart: Benchmarks
Please refer to the Benchmark Results for more details.
:checkered_flag: Getting Started
1. Structure
We provide a detailed project structure for X-SAM. Please follow this structure to organize the project.
π Structure (Click to expand)
X-SAM
βββ datas
β βββ gcg_seg_data
β βββ gen_seg_data
β βββ img_conv_data
β βββ inter_seg_data
β βββ LMUData
β βββ ov_seg_data
β βββ rea_seg_data
β βββ ref_seg_data
β βββ vgd_seg_data
βββ inits
β βββ huggingface
β βββ mask2former-swin-large-coco-panoptic
β βββ Phi-3-mini-4k-instruct
β βββ sam-vit-large
β βββ xsam
βββ xsam
β βββ docs
β βββ requirements
β βββ xsam
β β βββ configs
β β βββ dataset
β β βββ demo
β β βββ engine
β β βββ evaluation
β β βββ model
β β βββ structures
β β βββ tools
β β βββ utils
βββ wkdrs
β βββ s1_seg_finetune
β β βββ ...
β βββ s2_align_pretrain
β β βββ ...
β βββ s2_mixed_finetune
β β βββ ...
β βββ ...
...
2. Installation
We provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.
βοΈ Guide (Click to expand)
cd X-SAM
export root_dir=$(realpath ./)
cd $root_dir/xsam
# Optional: set CUDA_HOME for cuda12.4.
# X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually.
export CUDA_HOME="your_cuda12.4_path"
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
echo -e "cuda version:
$(nvcc -V)"
# create conda env for X-SAM
conda create -n xsam python=3.10 -y
conda activate xsam
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
# install gcc11(optional)
conda install gcc=11 gxx=11 -c conda-forge -y
# install xtuner0.2.0
pip install git+https://github.com/InternLM/xtuner.git@v0.2.0
cd xtuner
pip install '.[all]'
# install deepspeed
pip install -r requirements/deepspeed.txt
# install xsam requirements
pip install -r requirements/xsam.txt
# install flash-attention
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# install VLMEvalKit for evaluation on VLM benchmarks(optional)
cd $root_dir
git clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git
cd VLMEvalKit
pip install -e .
# install aria2 for downloading datasets and models(optional)
pip install aria2
3. Preparing
There are many datasets and models to prepare, please refer to Data Preparing and Model Preparing for more details.
4. Training & Evaluation
:sparkles: One Script for All !
π₯ Training (Click to expand)
Prepare the Datasets and Models, and then refer to the following command to start training.
cd $root_dir
bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX
Stage 1: Segmentor Fine-tuning
cd $root_dir
bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py
Stage 2: Alignment Pre-training
cd $root_dir
bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py
Stage 3: Mixed Fine-tuning
# NOTE: Training for Mixed Fine-tuning will be available with more than 500 π.
bash runs/run.sh --modes train,segeval,vlmeval,visualize --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py
π§ͺ Evaluation (Click to expand)
Download the pre-trained model from HuggingFaceπ€ (details in Model Preparing), and put them on $root_dir/inits directory.
cd $root_dir
bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix SUFFIX
Evaluate on all segmentation benchmarks
cd $root_dir
# Evaluate on all segmentation benchmarks.
# NOTE: ONLY generic segmentation and VGD segmentation are supported NOW.
bash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
Evaluate on all VLM benchmarks
cd $root_dir
# Evaluate on all VLM benchmarks.
bash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
:computer: Demo
Coming soon...
:white_check_mark: TODO
- Release the Online Demo.
- Release the Model Weights.
- Release the Technical Report.
- Release the code for training LLaVA-based MLLMs.
- Release the code for evaluation on all VLM Benchmarks.
- Release the code and instructions for demo deployment.
- Release the code for evaluation on all segmentation benchmarks.
- Release the code for training X-SAM (more than 500 π).
:blush: Acknowledge
This project has referenced some excellent open-sourced repos (xtuner, VLMEvalKit, Sa2VA). Thanks for their wonderful works and contributions to the community.
:pushpin: Citation
If you find X-SAM is helpful for your research or applications, please consider giving us a star π and citing it by the following BibTex entry.
@article{wang2025xsam,
title={X-SAM: From Segment Anything to Any Segmentation},
author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
journal={arXiv preprint arXiv:2508.04655},
year={2025}
}