Improve model card: Add pipeline tag, library name, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +253 -9
README.md CHANGED
@@ -1,9 +1,11 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
 
5
  tags:
6
  - MLLM
 
 
7
  ---
8
 
9
  <div align="center">
@@ -37,22 +39,265 @@ tags:
37
  </a>
38
  </div>
39
 
40
- ## πŸš€ Introduction
 
 
 
 
 
 
41
 
42
  * X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding.
43
 
44
  * X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
45
 
46
- * X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on various image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- ## πŸ”– Abstract
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding.
 
51
 
52
- πŸ‘‰ **More details can be found in [GitHub](https://github.com/wanghao9610/X-SAM).**
 
53
 
54
- ## πŸ“Œ Citation
55
- If you find X-SAM is helpful for your research or applications, please consider giving us a like πŸ’– and citing it by the following BibTex entry.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
 
57
  ```bibtex
58
  @article{wang2025xsam,
@@ -61,5 +306,4 @@ If you find X-SAM is helpful for your research or applications, please consider
61
  journal={arXiv preprint arXiv:2508.04655},
62
  year={2025}
63
  }
64
-
65
  ```
 
1
  ---
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
  tags:
6
  - MLLM
7
+ pipeline_tag: image-segmentation
8
+ library_name: transformers
9
  ---
10
 
11
  <div align="center">
 
39
  </a>
40
  </div>
41
 
42
+ ## :boom: Updates
43
+
44
+ - **`2025-08-06`**: Released the [Technical Report](https://arxiv.org/pdf/2508.04655).
45
+ - **`2025-08-05`**: Released the [Model Weights](https://huggingface.co/hao9610/X-SAM).
46
+ - **`2025-07-26`**: Released the [Online Demo](http://47.115.200.157:7861).
47
+
48
+ ## :rocket: Introduction
49
 
50
  * X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding.
51
 
52
  * X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities.
53
 
54
+ * X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding.
55
+
56
+ :sparkles: **HIGHLIGHT**: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs.
57
+
58
+ *If you have any questions, please feel free to open an issue or [contact me](mailto:wanghao9610@gmail.com).*
59
+
60
+ ## :bookmark: Abstract
61
+
62
+ Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL .
63
+
64
+ ## πŸ’» Usage
65
+
66
+ This model can be used with the Hugging Face `transformers` library.
67
+
68
+ ```python
69
+ from transformers import AutoProcessor, AutoModelForCausalLM
70
+ from PIL import Image
71
+ import torch
72
+
73
+ # Load model and processor. Ensure you have `bfloat16` support or adjust `torch_dtype`.
74
+ model_id = "hao9610/X-SAM"
75
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
76
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)
77
+
78
+ # Move model to GPU if available
79
+ if torch.cuda.is_available():
80
+ model = model.to("cuda")
81
+
82
+ # Example image and text prompt for Visual Grounded Segmentation
83
+ # Replace "path/to/your/image.jpg" with an actual image file path
84
+ # For a sample image, you can download one from the project's GitHub repo, e.g.,
85
+ # https://github.com/wanghao9610/X-SAM/blob/main/docs/images/xsam_framework.png
86
+ # and save it as "example_image.png"
87
+ image = Image.open("path/to/your/image.jpg").convert("RGB")
88
+ prompt = "Segment all instances in this image and provide their bounding box coordinates."
89
+
90
+ # Prepare messages for the model's chat template
91
+ messages = [
92
+ {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]}
93
+ ]
94
+
95
+ # Apply chat template and process inputs
96
+ text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
97
+ inputs = processor(text=[text_input], images=[image], return_tensors="pt")
98
+
99
+ # Move inputs to the same device as the model
100
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
101
+
102
+ # Generate output
103
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
104
+
105
+ # Decode the generated text
106
+ # The output will include special tokens for bounding boxes (e.g., <box>(x1,y1,x2,y2)</box>)
107
+ generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0]
108
+
109
+ print(generated_text)
110
+ # Expected output might look like: "object1 <box>(x1,y1,x2,y2)</box> object2 <box>(x1,y1,x2,y2)</box>"
111
+ ```
112
+
113
+ ## :mag: Overview
114
+
115
+ <img src="docs/images/xsam_framework.png" width="800">
116
+
117
+ ## :bar_chart: Benchmarks
118
+
119
+ Please refer to the [Benchmark Results](docs/benchmark_results.md) for more details.
120
+
121
+ ## :checkered_flag: Getting Started
122
+ ### 1. Structure
123
+ We provide a detailed project structure for X-SAM. Please follow this structure to organize the project.
124
+
125
+ <details>
126
+ <summary>πŸ“ Structure (Click to expand)</summary>
127
 
128
+ ```bash
129
+ X-SAM
130
+ β”œβ”€β”€ datas
131
+ β”‚Β Β  β”œβ”€β”€ gcg_seg_data
132
+ β”‚Β Β  β”œβ”€β”€ gen_seg_data
133
+ β”‚Β Β  β”œβ”€β”€ img_conv_data
134
+ β”‚Β Β  β”œβ”€β”€ inter_seg_data
135
+ β”‚Β Β  β”œβ”€β”€ LMUData
136
+ β”‚Β Β  β”œβ”€β”€ ov_seg_data
137
+ β”‚Β Β  β”œβ”€β”€ rea_seg_data
138
+ β”‚Β Β  β”œβ”€β”€ ref_seg_data
139
+ β”‚Β Β  └── vgd_seg_data
140
+ β”œβ”€β”€ inits
141
+ β”‚Β Β  β”œβ”€β”€ huggingface
142
+ β”‚Β Β  β”œβ”€β”€ mask2former-swin-large-coco-panoptic
143
+ β”‚Β Β  β”œβ”€β”€ Phi-3-mini-4k-instruct
144
+ β”‚Β Β  β”œβ”€β”€ sam-vit-large
145
+ β”‚Β Β  └── xsam
146
+ β”œβ”€β”€ xsam
147
+ β”‚Β Β  β”œβ”€β”€ docs
148
+ β”‚Β Β  β”œβ”€β”€ requirements
149
+ β”‚Β Β  β”œβ”€β”€ xsam
150
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ configs
151
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ dataset
152
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ demo
153
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ engine
154
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ evaluation
155
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model
156
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ structures
157
+ β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tools
158
+ β”‚Β Β  β”‚ └── utils
159
+ β”œβ”€β”€ wkdrs
160
+ β”‚Β Β  β”œβ”€β”€ s1_seg_finetune
161
+ β”‚ β”‚ β”œβ”€β”€ ...
162
+ β”‚Β Β  β”œβ”€β”€ s2_align_pretrain
163
+ β”‚ β”‚ β”œβ”€β”€ ...
164
+ β”‚Β Β  β”œβ”€β”€ s2_mixed_finetune
165
+ β”‚ β”‚ β”œβ”€β”€ ...
166
+ β”‚ β”œβ”€β”€ ...
167
+ ...
168
+ ```
169
+ </details>
170
 
171
+ ### 2. Installation
172
+ We provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.
173
 
174
+ <details>
175
+ <summary>βš™οΈ Guide (Click to expand)</summary>
176
 
177
+ ```bash
178
+ cd X-SAM
179
+ export root_dir=$(realpath ./)
180
+ cd $root_dir/xsam
181
+
182
+ # Optional: set CUDA_HOME for cuda12.4.
183
+ # X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually.
184
+ export CUDA_HOME="your_cuda12.4_path"
185
+ export PATH=$CUDA_HOME/bin:$PATH
186
+ export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
187
+ echo -e "cuda version:
188
+ $(nvcc -V)"
189
+
190
+ # create conda env for X-SAM
191
+ conda create -n xsam python=3.10 -y
192
+ conda activate xsam
193
+ conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
194
+ # install gcc11(optional)
195
+ conda install gcc=11 gxx=11 -c conda-forge -y
196
+ # install xtuner0.2.0
197
+ pip install git+https://github.com/InternLM/xtuner.git@v0.2.0
198
+ cd xtuner
199
+ pip install '.[all]'
200
+ # install deepspeed
201
+ pip install -r requirements/deepspeed.txt
202
+ # install xsam requirements
203
+ pip install -r requirements/xsam.txt
204
+ # install flash-attention
205
+ pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
206
+
207
+ # install VLMEvalKit for evaluation on VLM benchmarks(optional)
208
+ cd $root_dir
209
+ git clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git
210
+ cd VLMEvalKit
211
+ pip install -e .
212
+
213
+ # install aria2 for downloading datasets and models(optional)
214
+ pip install aria2
215
+ ```
216
+
217
+ </details>
218
+
219
+ ### 3. Preparing
220
+ There are many datasets and models to prepare, please refer to [Data Preparing](docs/data_preparing.md) and [Model Preparing](docs/model_preparing.md) for more details.
221
+
222
+ ### 4. Training & Evaluation
223
+ :sparkles: **One Script for All !**
224
+
225
+ <details>
226
+ <summary>πŸ”₯ Training (Click to expand)</summary>
227
+
228
+ Prepare the [Datasets](docs/data_preparing.md) and [Models](docs/model_preparing.md), and then refer to the following command to start training.
229
+
230
+ ```bash
231
+ cd $root_dir
232
+ bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX
233
+ ```
234
+
235
+ ##### Stage 1: Segmentor Fine-tuning
236
+ ```bash
237
+ cd $root_dir
238
+ bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py
239
+ ```
240
+
241
+ ##### Stage 2: Alignment Pre-training
242
+ ```bash
243
+ cd $root_dir
244
+ bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py
245
+ ```
246
+
247
+ ##### Stage 3: Mixed Fine-tuning
248
+ ```bash
249
+ # NOTE: Training for Mixed Fine-tuning will be available with more than 500 🌟.
250
+ bash runs/run.sh --modes train,segeval,vlmeval,visualize --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py
251
+ ```
252
+
253
+ </details>
254
+
255
+
256
+ <details>
257
+ <summary>πŸ§ͺ Evaluation (Click to expand)</summary>
258
+
259
+ Download the pre-trained model from [HuggingFaceπŸ€—](https://huggingface.co/hao9610/X-SAM) (details in [Model Preparing](docs/model_preparing.md)), and put them on $root_dir/inits directory.
260
+
261
+ ```bash
262
+ cd $root_dir
263
+ bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix SUFFIX
264
+ ```
265
+
266
+ ##### Evaluate on all segmentation benchmarks
267
+ ```bash
268
+ cd $root_dir
269
+ # Evaluate on all segmentation benchmarks.
270
+ # NOTE: ONLY generic segmentation and VGD segmentation are supported NOW.
271
+ bash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
272
+ ```
273
+
274
+ ##### Evaluate on all VLM benchmarks
275
+ ```bash
276
+ cd $root_dir
277
+ # Evaluate on all VLM benchmarks.
278
+ bash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune
279
+ ```
280
+
281
+ </details>
282
+
283
+ ## :computer: Demo
284
+ Coming soon...
285
+
286
+ ## :white_check_mark: TODO
287
+ - [x] Release the [Online Demo](http://47.115.200.157:7861).
288
+ - [x] Release the [Model Weights](https://huggingface.co/hao9610/X-SAM).
289
+ - [x] Release the [Technical Report](https://arxiv.org/abs/2508.04655).
290
+ - [ ] Release the code for training LLaVA-based MLLMs.
291
+ - [ ] Release the code for evaluation on all VLM Benchmarks.
292
+ - [ ] Release the code and instructions for demo deployment.
293
+ - [ ] Release the code for evaluation on all segmentation benchmarks.
294
+ - [ ] Release the code for training X-SAM (more than 500 🌟).
295
+
296
+ ## :blush: Acknowledge
297
+ This project has referenced some excellent open-sourced repos ([xtuner](https://github.com/InternLM/xtuner), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [Sa2VA](https://github.com/magic-research/Sa2VA)). Thanks for their wonderful works and contributions to the community.
298
+
299
+ ## :pushpin: Citation
300
+ If you find X-SAM is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry.
301
 
302
  ```bibtex
303
  @article{wang2025xsam,
 
306
  journal={arXiv preprint arXiv:2508.04655},
307
  year={2025}
308
  }
 
309
  ```