UltraDoughnut
/

SELF1E

Safetensors

Model card Files Files and versions

xet

Community

Add model card and metadata for SELF1E

by nielsr HF Staff - opened Mar 20

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+42

-3

Files changed (1) hide show

README.md +42 -3

README.md CHANGED Viewed

@@ -1,3 +1,42 @@
----
-license: mit
----

+---
+license: mit
+library_name: transformers
+pipeline_tag: image-segmentation
+---
+# SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token
+This repository contains the weights for **SELF1E** (**S**egmentation **E**mbedding from MLLM it**SELF** with **1** token), an approach that enables Multi-modal Large Language Models to perform high-quality segmentation without external specialist decoders.
+- **Paper:** [Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token](https://huggingface.co/papers/2603.19026)
+- **GitHub Repository:** [https://github.com/ANDYZAQ/SELF1E](https://github.com/ANDYZAQ/SELF1E)
+## Highlights
+- ✅ **No external expert decoder** for text-guided referring segmentation.
+- ✅ **Only 1 `[SEG]` token** for segmentation.
+- ✅ **Competitive results** while eliminating the need for external decoders (like SAM).
+- 🚀 A step forward for integrating segmentation ability directly inside MLLMs.
+## Introduction
+SELF1E investigates whether and how we can unlock segmentation ability from MLLM it**SELF** with **1** segmentation **E**mbedding while achieving competitive results. The approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs by:
+1. Retaining image features at their original uncompressed resolution and refilling them with residual features.
+2. Integrating pixel-unshuffle operations to unleash details.
+3. Redesigning the attention mask with dual perception pathways (image-to-image and image-to-segmentation).
+## Citation
+If you find this project useful in your research, please consider citing:
+```bibtex
+@inproceedings{zhang2026self1e,
+  author = {Zhang, Anqi and Ji, Xiaokang and Gao, Guangyu and Jiao, Jianbo and Liu, Chi Harold and Wei, Yunchao},
+  title = {SELF1E: Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token},
+  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  year = {2026},
+}
+```
+## Acknowledgement
+This work is built upon the [LISA](https://github.com/JIA-Lab-research/LISA) framework and some of the training settings are borrowed from [PSALM](https://github.com/zamling/PSALM).