hao9610
/

X-SAM

English

MLLM

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library name, and usage example

by nielsr HF Staff - opened Aug 7, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+253

-9

Files changed (1) hide show

README.md +253 -9

README.md CHANGED Viewed

@@ -1,9 +1,11 @@
 ---
-license: apache-2.0
 language:
 - en
 tags:
 - MLLM
 ---
 <div align="center">
@@ -37,22 +39,265 @@ tags:
   </a>
 </div>
-## 🚀 Introduction
 * X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding.
 * X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
-* X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on various image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.
-## 🔖 Abstract
-Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.
-👉 **More details can be found in [GitHub](https://github.com/wanghao9610/X-SAM).**
-## 📌 Citation
-If you find X-SAM is helpful for your research or applications, please consider giving us a like 💖 and citing it by the following BibTex entry.
 ```bibtex
 @article{wang2025xsam,
@@ -61,5 +306,4 @@ If you find X-SAM is helpful for your research or applications, please consider
   journal={arXiv preprint arXiv:2508.04655},
   year={2025}
 }
 ```

 ---
 language:
 - en
+license: apache-2.0
 tags:
 - MLLM
+pipeline_tag: image-segmentation
+library_name: transformers
 ---
 <div align="center">
   </a>
 </div>
+## :boom: Updates
+- **`2025-08-06`**: Released the [Technical Report](https://arxiv.org/pdf/2508.04655).
+- **`2025-08-05`**: Released the [Model Weights](https://huggingface.co/hao9610/X-SAM).
+- **`2025-07-26`**: Released the [Online Demo](http://47.115.200.157:7861).
+## :rocket: Introduction
 * X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding.
 * X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
+* X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.
+:sparkles: **HIGHLIGHT**: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs.
+*If you have any questions, please feel free to open an issue or [contact me](mailto:wanghao9610@gmail.com).*
+## :bookmark: Abstract
+Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL .
+## 💻 Usage
+This model can be used with the Hugging Face `transformers` library.
+```python
+from transformers import AutoProcessor, AutoModelForCausalLM
+from PIL import Image
+import torch
+# Load model and processor. Ensure you have `bfloat16` support or adjust `torch_dtype`.
+model_id = "hao9610/X-SAM"
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)
+# Move model to GPU if available
+if torch.cuda.is_available():
+    model = model.to("cuda")
+# Example image and text prompt for Visual Grounded Segmentation
+# Replace "path/to/your/image.jpg" with an actual image file path
+# For a sample image, you can download one from the project's GitHub repo, e.g.,
+# https://github.com/wanghao9610/X-SAM/blob/main/docs/images/xsam_framework.png
+# and save it as "example_image.png"
+image = Image.open("path/to/your/image.jpg").convert("RGB")
+prompt = "Segment all instances in this image and provide their bounding box coordinates."
+# Prepare messages for the model's chat template
+messages = [
+    {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
+]
+# Apply chat template and process inputs
+text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text_input], images=[image], return_tensors="pt")
+# Move inputs to the same device as the model
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+# Generate output
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+# Decode the generated text
+# The output will include special tokens for bounding boxes (e.g., <box>(x1,y1,x2,y2)</box>)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
+print(generated_text)
+# Expected output might look like: "object1 <box>(x1,y1,x2,y2)</box> object2 <box>(x1,y1,x2,y2)</box>"
+```
+## :mag: Overview
+<img src="docs/images/xsam_framework.png" width="800">
+## :bar_chart: Benchmarks
+Please refer to the [Benchmark Results](docs/benchmark_results.md) for more details.
+## :checkered_flag: Getting Started
+### 1. Structure
+We provide a detailed project structure for X-SAM. Please follow this structure to organize the project.
+<details>
+<summary>📁 Structure (Click to expand)</summary>
+```bash
+X-SAM
+├── datas
+│   ├── gcg_seg_data
+│   ├── gen_seg_data
+│   ├── img_conv_data
+│   ├── inter_seg_data
+│   ├── LMUData
+│   ├── ov_seg_data
+│   ├── rea_seg_data
+│   ├── ref_seg_data
+│   └── vgd_seg_data
+├── inits
+│   ├── huggingface
+│   ├── mask2former-swin-large-coco-panoptic
+│   ├── Phi-3-mini-4k-instruct
+│   ├── sam-vit-large
+│   └── xsam
+├── xsam
+│   ├── docs
+│   ├── requirements
+│   ├── xsam
+│   │   ├── configs
+│   │   ├── dataset
+│   │   ├── demo
+│   │   ├── engine
+│   │   ├── evaluation
+│   │   ├── model
+│   │   ├── structures
+│   │   ├── tools
+│   │   └── utils
+├── wkdrs
+│   ├── s1_seg_finetune
+│   │   ├── ...
+│   ├── s2_align_pretrain
+│   │   ├── ...
+│   ├── s2_mixed_finetune
+│   │   ├── ...
+│   ├── ...
+...
+```
+</details>
+### 2. Installation
+We provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.
+<details>
+<summary>⚙️ Guide (Click to expand)</summary>
+```bash
+cd X-SAM
+export root_dir=$(realpath ./)
+cd $root_dir/xsam
+# Optional: set CUDA_HOME for cuda12.4.
+# X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually.
+export CUDA_HOME="your_cuda12.4_path"
+export PATH=$CUDA_HOME/bin:$PATH
+export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
+echo -e "cuda version:
+$(nvcc -V)"
+# create conda env for X-SAM
+conda create -n xsam python=3.10 -y
+conda activate xsam
+conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
+# install gcc11(optional)
+conda install gcc=11 gxx=11 -c conda-forge -y
+# install xtuner0.2.0
+pip install git+https://github.com/InternLM/xtuner.git@v0.2.0
+cd xtuner
+pip install '.[all]'
+# install deepspeed
+pip install -r requirements/deepspeed.txt
+# install xsam requirements
+pip install -r requirements/xsam.txt
+# install flash-attention
+pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
+# install VLMEvalKit for evaluation on VLM benchmarks(optional)
+cd $root_dir
+git clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git
+cd VLMEvalKit
+pip install -e .
+# install aria2 for downloading datasets and models(optional)
+pip install aria2
+```
+</details>
+### 3. Preparing
+There are many datasets and models to prepare, please refer to [Data Preparing](docs/data_preparing.md) and [Model Preparing](docs/model_preparing.md) for more details.
+### 4. Training & Evaluation
+:sparkles: **One Script for All !**
+<details>
+<summary>🔥 Training (Click to expand)</summary>
+Prepare the [Datasets](docs/data_preparing.md) and [Models](docs/model_preparing.md), and then refer to the following command to start training.
+```bash
+cd $root_dir
+bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX
+```
+##### Stage 1: Segmentor Fine-tuning
+```bash
+cd $root_dir
+bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py
+```
+##### Stage 2: Alignment Pre-training
+```bash
+cd $root_dir
+bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py
+```
+##### Stage 3: Mixed Fine-tuning
+```bash
+# NOTE: Training for Mixed Fine-tuning will be available with more than 500 🌟.
+bash runs/run.sh --modes train,segeval,vlmeval,visualize --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py
+```
+</details>
+<details>
+<summary>🧪 Evaluation (Click to expand)</summary>
+Download the pre-trained model from [HuggingFace🤗](https://huggingface.co/hao9610/X-SAM) (details in [Model Preparing](docs/model_preparing.md)), and put them on $root_dir/inits directory.
+```bash
+cd $root_dir
+bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix SUFFIX
+```
+##### Evaluate on all segmentation benchmarks
+```bash
+cd $root_dir
+# Evaluate on all segmentation benchmarks.
+# NOTE: ONLY generic segmentation and VGD segmentation are supported NOW.
+bash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
+```
+##### Evaluate on all VLM benchmarks
+```bash
+cd $root_dir
+# Evaluate on all VLM benchmarks.
+bash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
+```
+</details>
+## :computer: Demo
+Coming soon...
+## :white_check_mark: TODO
+- [x] Release the [Online Demo](http://47.115.200.157:7861).
+- [x] Release the [Model Weights](https://huggingface.co/hao9610/X-SAM).
+- [x] Release the [Technical Report](https://arxiv.org/abs/2508.04655).
+- [ ] Release the code for training LLaVA-based MLLMs.
+- [ ] Release the code for evaluation on all VLM Benchmarks.
+- [ ] Release the code and instructions for demo deployment.
+- [ ] Release the code for evaluation on all segmentation benchmarks.
+- [ ] Release the code for training X-SAM (more than 500 🌟).
+## :blush: Acknowledge
+This project has referenced some excellent open-sourced repos ([xtuner](https://github.com/InternLM/xtuner), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [Sa2VA](https://github.com/magic-research/Sa2VA)). Thanks for their wonderful works and contributions to the community.
+## :pushpin: Citation
+If you find X-SAM is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry.
 ```bibtex
 @article{wang2025xsam,
   journal={arXiv preprint arXiv:2508.04655},
   year={2025}
 }
 ```