--- language: - en license: apache-2.0 tags: - MLLM pipeline_tag: image-segmentation library_name: transformers ---

✨X-SAM

From Segment Anything to Any Segmentation

[Hao Wang](https://github.com/wanghao9610)1,2,[Limeng Qiao](https://scholar.google.com/citations?user=3PFZAg0AAAAJ&hl=en)3,[Zequn Jie](https://scholar.google.com/citations?user=4sKGNB0AAAAJ&hl)3, [Zhijian Huang](https://zhijian11.github.io/)1, [Chengjian Feng](https://fcjian.github.io/)3, [Qingfang Zheng](https://openreview.net/profile?id=%7EZheng_Qingfang1)1, [Lin Ma](https://forestlinma.com/)3, [Xiangyuan Lan](https://scholar.google.com/citations?user=c3iwWRcAAAAJ&hl)2 πŸ“§, [Xiaodan Liang](https://scholar.google.com/citations?user=voxznZAAAAAJ&hl)1 πŸ“§ 1 Sun Yat-sen University, 2 Peng Cheng Laboratory, 3 Meituan Inc. πŸ“§ Corresponding author
arxiv huggingface GitHub Demo webpage
## :boom: Updates - **`2025-08-06`**: Released the [Technical Report](https://arxiv.org/pdf/2508.04655). - **`2025-08-05`**: Released the [Model Weights](https://huggingface.co/hao9610/X-SAM). - **`2025-07-26`**: Released the [Online Demo](http://47.115.200.157:7861). ## :rocket: Introduction * X-SAM introduces a unified multimodal large language model (MLLM) framework, extending the segmentation paradigm from *segment anything* to *any segmentation*, thereby enhancing pixel-level perceptual understanding. * X-SAM proposes a novel Visual GrounDed (VGD) segmentation task, which segments all instance objects using interactive visual prompts, empowering the model with visually grounded, pixel-wise interpretative capabilities. * X-SAM presents a unified training strategy that enables co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency in multimodal, pixel-level visual understanding. :sparkles: **HIGHLIGHT**: This repository provides unified and effective code for training, evaluation, and visualization of segmentation MLLMs, including LLaVA-based MLLMs. We hope this repository will promote further research on MLLMs. *If you have any questions, please feel free to open an issue or [contact me](mailto:wanghao9610@gmail.com).* ## :bookmark: Abstract Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from *segment anything* to *any segmentation*. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at this https URL . ## πŸ’» Usage This model can be used with the Hugging Face `transformers` library. ```python from transformers import AutoProcessor, AutoModelForCausalLM from PIL import Image import torch # Load model and processor. Ensure you have `bfloat16` support or adjust `torch_dtype`. model_id = "hao9610/X-SAM" processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True) # Move model to GPU if available if torch.cuda.is_available(): model = model.to("cuda") # Example image and text prompt for Visual Grounded Segmentation # Replace "path/to/your/image.jpg" with an actual image file path # For a sample image, you can download one from the project's GitHub repo, e.g., # https://github.com/wanghao9610/X-SAM/blob/main/docs/images/xsam_framework.png # and save it as "example_image.png" image = Image.open("path/to/your/image.jpg").convert("RGB") prompt = "Segment all instances in this image and provide their bounding box coordinates." # Prepare messages for the model's chat template messages = [ {"role": "user", "content": [{"type": "image", "image": image}, {"type": "text", "text": prompt}]} ] # Apply chat template and process inputs text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text_input], images=[image], return_tensors="pt") # Move inputs to the same device as the model inputs = {k: v.to(model.device) for k, v in inputs.items()} # Generate output generated_ids = model.generate(**inputs, max_new_tokens=128) # Decode the generated text # The output will include special tokens for bounding boxes (e.g., (x1,y1,x2,y2)) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False)[0] print(generated_text) # Expected output might look like: "object1 (x1,y1,x2,y2) object2 (x1,y1,x2,y2)" ``` ## :mag: Overview ## :bar_chart: Benchmarks Please refer to the [Benchmark Results](docs/benchmark_results.md) for more details. ## :checkered_flag: Getting Started ### 1. Structure We provide a detailed project structure for X-SAM. Please follow this structure to organize the project.
πŸ“ Structure (Click to expand) ```bash X-SAM β”œβ”€β”€ datas β”‚Β Β  β”œβ”€β”€ gcg_seg_data β”‚Β Β  β”œβ”€β”€ gen_seg_data β”‚Β Β  β”œβ”€β”€ img_conv_data β”‚Β Β  β”œβ”€β”€ inter_seg_data β”‚Β Β  β”œβ”€β”€ LMUData β”‚Β Β  β”œβ”€β”€ ov_seg_data β”‚Β Β  β”œβ”€β”€ rea_seg_data β”‚Β Β  β”œβ”€β”€ ref_seg_data β”‚Β Β  └── vgd_seg_data β”œβ”€β”€ inits β”‚Β Β  β”œβ”€β”€ huggingface β”‚Β Β  β”œβ”€β”€ mask2former-swin-large-coco-panoptic β”‚Β Β  β”œβ”€β”€ Phi-3-mini-4k-instruct β”‚Β Β  β”œβ”€β”€ sam-vit-large β”‚Β Β  └── xsam β”œβ”€β”€ xsam β”‚Β Β  β”œβ”€β”€ docs β”‚Β Β  β”œβ”€β”€ requirements β”‚Β Β  β”œβ”€β”€ xsam β”‚Β Β  β”‚Β Β  β”œβ”€β”€ configs β”‚Β Β  β”‚Β Β  β”œβ”€β”€ dataset β”‚Β Β  β”‚Β Β  β”œβ”€β”€ demo β”‚Β Β  β”‚Β Β  β”œβ”€β”€ engine β”‚Β Β  β”‚Β Β  β”œβ”€β”€ evaluation β”‚Β Β  β”‚Β Β  β”œβ”€β”€ model β”‚Β Β  β”‚Β Β  β”œβ”€β”€ structures β”‚Β Β  β”‚Β Β  β”œβ”€β”€ tools β”‚Β Β  β”‚ └── utils β”œβ”€β”€ wkdrs β”‚Β Β  β”œβ”€β”€ s1_seg_finetune β”‚ β”‚ β”œβ”€β”€ ... β”‚Β Β  β”œβ”€β”€ s2_align_pretrain β”‚ β”‚ β”œβ”€β”€ ... β”‚Β Β  β”œβ”€β”€ s2_mixed_finetune β”‚ β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ ... ... ```
### 2. Installation We provide a detailed installation guide to create a environment for X-SAM, please refer to the following steps.
βš™οΈ Guide (Click to expand) ```bash cd X-SAM export root_dir=$(realpath ./) cd $root_dir/xsam # Optional: set CUDA_HOME for cuda12.4. # X-SAM utilizes the cuda12.4 default, if your cuda is not cuda12.4, you need first export CUDA_HOME env manually. export CUDA_HOME="your_cuda12.4_path" export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH echo -e "cuda version: $(nvcc -V)" # create conda env for X-SAM conda create -n xsam python=3.10 -y conda activate xsam conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia # install gcc11(optional) conda install gcc=11 gxx=11 -c conda-forge -y # install xtuner0.2.0 pip install git+https://github.com/InternLM/xtuner.git@v0.2.0 cd xtuner pip install '.[all]' # install deepspeed pip install -r requirements/deepspeed.txt # install xsam requirements pip install -r requirements/xsam.txt # install flash-attention pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl # install VLMEvalKit for evaluation on VLM benchmarks(optional) cd $root_dir git clone -b v0.3rc1 https://github.com/open-compass/VLMEvalKit.git cd VLMEvalKit pip install -e . # install aria2 for downloading datasets and models(optional) pip install aria2 ```
### 3. Preparing There are many datasets and models to prepare, please refer to [Data Preparing](docs/data_preparing.md) and [Model Preparing](docs/model_preparing.md) for more details. ### 4. Training & Evaluation :sparkles: **One Script for All !**
πŸ”₯ Training (Click to expand) Prepare the [Datasets](docs/data_preparing.md) and [Models](docs/model_preparing.md), and then refer to the following command to start training. ```bash cd $root_dir bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix WORK_DIR_SUFFIX ``` ##### Stage 1: Segmentor Fine-tuning ```bash cd $root_dir bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s1_seg_finetune/xsam_sam_large_m2f_e36_gpu16_seg_finetune.py ``` ##### Stage 2: Alignment Pre-training ```bash cd $root_dir bash runs/run.sh --modes train --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s2_align_pretrain/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_e1_gpu16_align_pretrain.py ``` ##### Stage 3: Mixed Fine-tuning ```bash # NOTE: Training for Mixed Fine-tuning will be available with more than 500 🌟. bash runs/run.sh --modes train,segeval,vlmeval,visualize --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py ```
πŸ§ͺ Evaluation (Click to expand) Download the pre-trained model from [HuggingFaceπŸ€—](https://huggingface.co/hao9610/X-SAM) (details in [Model Preparing](docs/model_preparing.md)), and put them on $root_dir/inits directory. ```bash cd $root_dir bash runs/run.sh --modes MODES --config CONFIG_FILE --work-dir WORK_DIR --suffix SUFFIX ``` ##### Evaluate on all segmentation benchmarks ```bash cd $root_dir # Evaluate on all segmentation benchmarks. # NOTE: ONLY generic segmentation and VGD segmentation are supported NOW. bash runs/run.sh --modes segeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune ``` ##### Evaluate on all VLM benchmarks ```bash cd $root_dir # Evaluate on all VLM benchmarks. bash runs/run.sh --modes vlmeval --config xsam/configs/xsam/phi3_mini_4k_instruct_siglip2_so400m_p14_384/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune.py --work-dir $root_dir/inits/X-SAM/s3_mixed_finetune/xsam_phi3_mini_4k_instruct_siglip2_so400m_p14_384_sam_large_m2f_gpu16_mixed_finetune ```
## :computer: Demo Coming soon... ## :white_check_mark: TODO - [x] Release the [Online Demo](http://47.115.200.157:7861). - [x] Release the [Model Weights](https://huggingface.co/hao9610/X-SAM). - [x] Release the [Technical Report](https://arxiv.org/abs/2508.04655). - [ ] Release the code for training LLaVA-based MLLMs. - [ ] Release the code for evaluation on all VLM Benchmarks. - [ ] Release the code and instructions for demo deployment. - [ ] Release the code for evaluation on all segmentation benchmarks. - [ ] Release the code for training X-SAM (more than 500 🌟). ## :blush: Acknowledge This project has referenced some excellent open-sourced repos ([xtuner](https://github.com/InternLM/xtuner), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), [Sa2VA](https://github.com/magic-research/Sa2VA)). Thanks for their wonderful works and contributions to the community. ## :pushpin: Citation If you find X-SAM is helpful for your research or applications, please consider giving us a star 🌟 and citing it by the following BibTex entry. ```bibtex @article{wang2025xsam, title={X-SAM: From Segment Anything to Any Segmentation}, author={Wang, Hao and Qiao, Limeng and Jie, Zequn and Huang, Zhijian and Feng, Chengjian and Zheng, Qingfang and Ma, Lin and Lan, Xiangyuan and Liang, Xiaodan}, journal={arXiv preprint arXiv:2508.04655}, year={2025} } ```