X-SAM / README.md

Improve model card: Add pipeline tag, library name, and usage example

b60f910 verified 7 months ago

14.2 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- MLLM
	pipeline_tag: image-segmentation
	library_name: transformers
	---

	<div align="center">
	<h1>✨X-SAM </h1>
	<h3>From Segment Anything to Any Segmentation</h3>

	[Hao Wang](https://github.com/wanghao9610)<sup>1,2</sup>,[Limeng Qiao](https://scholar.google.com/citations?user=3PFZAg0AAAAJ&hl=en)<sup>3</sup>,[Zequn Jie](https://scholar.google.com/citations?user=4sKGNB0AAAAJ&hl)<sup>3</sup>, [Zhijian Huang](https://zhijian11.github.io/)<sup>1</sup>, [Chengjian Feng](https://fcjian.github.io/)<sup>3</sup>,

	[Qingfang Zheng](https://openreview.net/profile?id=%7EZheng_Qingfang1)<sup>1</sup>, [Lin Ma](https://forestlinma.com/)<sup>3</sup>, [Xiangyuan Lan](https://scholar.google.com/citations?user=c3iwWRcAAAAJ&hl)<sup>2 📧</sup>, [Xiaodan Liang](https://scholar.google.com/citations?user=voxznZAAAAAJ&hl)<sup>1 📧</sup>

	<sup>1</sup> Sun Yat-sen University, <sup>2</sup> Peng Cheng Laboratory, <sup>3</sup> Meituan Inc.

	<sup>📧</sup> Corresponding author
	</div>

	<div align="center" style="display: flex; justify-content: center; align-items: center;">
	<a href="https://arxiv.org/abs/2508.04655" style="margin: 0 2px;">
	<img src='https://img.shields.io/badge/arXiv-2508.04655-red?style=flat&logo=arXiv&logoColor=red' alt='arxiv'>
	</a>
	<a href='https://huggingface.co/hao9610/X-SAM' style="margin: 0 2px;">
	<img src='https://img.shields.io/badge/Hugging Face-ckpts-orange?style=flat&logo=HuggingFace&logoColor=orange' alt='huggingface'>
	</a>
	<a href="https://github.com/wanghao9610/X-SAM" style="margin: 0 2px;">
	<img src='https://img.shields.io/badge/GitHub-Repo-blue?style=flat&logo=GitHub' alt='GitHub'>
	</a>
	<a href="http://47.115.200.157:7861" style="margin: 0 2px;">
	<img src='https://img.shields.io/badge/Demo-Gradio-gold?style=flat&logo=Gradio&logoColor=red' alt='Demo'>
	</a>
	<a href='https://wanghao9610.github.io/X-SAM/' style="margin: 0 2px;">
	<img src='https://img.shields.io/badge/🌐_Project-Webpage-green?style=flat&logoColor=white' alt='webpage'>
	</a>
	</div>

	## :boom: Updates

	- `2025-08-06`: Released the [Technical Report](https://arxiv.org/pdf/2508.04655).
	- `2025-08-05`: Released the [Model Weights](https://huggingface.co/hao9610/X-SAM).
	- `2025-07-26`: Released the [Online Demo](http://47.115.200.157:7861).

	## :rocket: Introduction

	* X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from segment anything to any segmentation, thereby enhancing pixel-level perceptual understanding.

	* X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.

	* X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.

	:sparkles: HIGHLIGHT: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs.

	If you have any questions, please feel free to open an issue or [contact me](mailto:wanghao9610@gmail.com).

	## :bookmark: Abstract

	Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from segment anything to any segmentation. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL .

	## 💻 Usage

	This model can be used with the Hugging Face `transformers` library.

	```python
	from transformers import AutoProcessor, AutoModelForCausalLM
	from PIL import Image
	import torch

	# Load model and processor. Ensure you have `bfloat16` support or adjust `torch_dtype`.
	model_id = "hao9610/X-SAM"
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)

	# Move model to GPU if available
	if torch.cuda.is_available():
	model = model.to("cuda")

	# Example image and text prompt for Visual Grounded Segmentation
	# Replace "path/to/your/image.jpg" with an actual image file path
	# For a sample image, you can download one from the project's GitHub repo, e.g.,
	# https://github.com/wanghao9610/X-SAM/blob/main/docs/images/xsam_framework.png
	# and save it as "example_image.png"
	image = Image.open("path/to/your/image.jpg").convert("RGB")
	prompt = "Segment all instances in this image and provide their bounding box coordinates."

	# Prepare messages for the model's chat template
	messages = [
	{"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
	]

	# Apply chat template and process inputs
	text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=[text_input], images=[image], return_tensors="pt")

	# Move inputs to the same device as the model
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	# Generate output
	generated_ids = model.generate(**inputs, max_new_tokens=128)

	# Decode the generated text
	# The output will include special tokens for bounding boxes (e.g., <box>(x1,y1,x2,y2)</box>)
	generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]

	print(generated_text)
	# Expected output might look like: "object1 <box>(x1,y1,x2,y2)</box> object2 <box>(x1,y1,x2,y2)</box>"
	```

	## :mag: Overview

	<img src="docs/images/xsam_framework.png" width="800">

	## :bar_chart: Benchmarks

	Please refer to the [Benchmark Results](docs/benchmark_results.md) for more details.

	## :checkered_flag: Getting Started
	### 1. Structure
	We provide a detailed project structure for X-SAM. Please follow this structure to organize the project.

	<details>
	<summary>📁 Structure (Click to expand)</summary>

	```bash
	X-SAM
	├── datas
	│ ├── gcg_seg_data
	│ ├── gen_seg_data
	│ ├── img_conv_data
	│ ├── inter_seg_data
	│ ├── LMUData
	│ ├── ov_seg_data
	│ ├── rea_seg_data
	│ ├── ref_seg_data
	│ └── vgd_seg_data
	├── inits
	│ ├── huggingface
	│ ├── mask2former-swin-large-coco-panoptic
	│ ├── Phi-3-mini-4k-instruct
	│ ├── sam-vit-large
	│ └── xsam
	├── xsam
	│ ├── docs
	│ ├── requirements
	│ ├── xsam
	│ │ ├── configs
	│ │ ├── dataset
	│ │ ├── demo
	│ │ ├── engine
	│ │ ├── evaluation
	│ │ ├── model
	│ │ ├── structures
	│ │ ├── tools
	│ │ └── utils
	├── wkdrs
	│ ├── s1_seg_finetune
	│ │ ├── ...
	│ ├── s2_align_pretrain
	│ │ ├── ...
	│ ├── s2_mixed_finetune
	│ │ ├── ...
	│ ├── ...
	...
	```
	</details>

	### 2. Installation
	We provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.

	<details>
	<summary>⚙️ Guide (Click to expand)</summary>

	```bash
	cd X-SAM
	export root_dir=$(realpath ./)
	cd $root_dir/xsam

	# Optional: set CUDA_HOME for cuda12.4.
	# X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually.
	export CUDA_HOME="your_cuda12.4_path"
	export PATH=$CUDA_HOME/bin:$PATH
	export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
	echo -e "cuda version:
	$(nvcc -V)"

	# create conda env for X-SAM
	conda create -n xsam python=3.10 -y
	conda activate xsam
	conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
	# install gcc11(optional)
	conda install gcc=11 gxx=11 -c conda-forge -y
	# install xtuner0.2.0
	pip install git+https://github.com/InternLM/xtuner.git@v0.2.0
	cd xtuner
	pip install '.[all]'
	# install deepspeed
	pip install -r requirements/deepspeed.txt
	# install xsam requirements
	pip install -r requirements/xsam.txt
	# install flash-attention
	pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

	# install VLMEvalKit for evaluation on VLM benchmarks(optional)
	cd $root_dir
	git clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git
	cd VLMEvalKit
	pip install -e .

	# install aria2 for downloading datasets and models(optional)
	pip install aria2
	```

	</details>

	### 3. Preparing
	There are many datasets and models to prepare, please refer to [Data Preparing](docs/data_preparing.md) and [Model Preparing](docs/model_preparing.md) for more details.

	### 4. Training & Evaluation
	:sparkles: One Script for All !

	<details>
	<summary>🔥 Training (Click to expand)</summary>

	Prepare the [Datasets](docs/data_preparing.md) and [Models](docs/model_preparing.md), and then refer to the following command to start training.

	```bash
	cd $root_dir
	bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX
	```

	##### Stage 1: Segmentor Fine-tuning
	```bash
	cd $root_dir
	bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py
	```

	##### Stage 2: Alignment Pre-training
	```bash
	cd $root_dir
	bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py
	```

	##### Stage 3: Mixed Fine-tuning
	```bash
	# NOTE: Training for Mixed Fine-tuning will be available with more than 500 🌟.
	bash runs/run.sh --modes train,segeval,vlmeval,visualize --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py
	```

	</details>


	<details>
	<summary>🧪 Evaluation (Click to expand)</summary>

	Download the pre-trained model from [HuggingFace🤗](https://huggingface.co/hao9610/X-SAM) (details in [Model Preparing](docs/model_preparing.md)), and put them on $root_dir/inits directory.

	```bash
	cd $root_dir
	bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix SUFFIX
	```

	##### Evaluate on all segmentation benchmarks
	```bash
	cd $root_dir
	# Evaluate on all segmentation benchmarks.
	# NOTE: ONLY generic segmentation and VGD segmentation are supported NOW.
	bash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
	```

	##### Evaluate on all VLM benchmarks
	```bash
	cd $root_dir
	# Evaluate on all VLM benchmarks.
	bash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
	```

	</details>

	## :computer: Demo
	Coming soon...

	## :white_check_mark: TODO
	- [x] Release the [Online Demo](http://47.115.200.157:7861).
	- [x] Release the [Model Weights](https://huggingface.co/hao9610/X-SAM).
	- [x] Release the [Technical Report](https://arxiv.org/abs/2508.04655).
	- [ ] Release the code for training LLaVA-based MLLMs.
	- [ ] Release the code for evaluation on all VLM Benchmarks.
	- [ ] Release the code and instructions for demo deployment.
	- [ ] Release the code for evaluation on all segmentation benchmarks.
	- [ ] Release the code for training X-SAM (more than 500 🌟).

	## :blush: Acknowledge
	This project has referenced some excellent open-sourced repos ([xtuner](https://github.com/InternLM/xtuner), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [Sa2VA](https://github.com/magic-research/Sa2VA)). Thanks for their wonderful works and contributions to the community.

	## :pushpin: Citation
	If you find X-SAM is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry.

	```bibtex
	@article{wang2025xsam,
	title={X-SAM: From Segment Anything to Any Segmentation},
	author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan},
	journal={arXiv preprint arXiv:2508.04655},
	year={2025}
	}
	```