Add model card and metadata for Dr. Seg
Browse filesHi! I'm Niels from the Hugging Face community science team. I've opened this PR to improve the model card for Dr. Seg.
Changes include:
- Added metadata for `pipeline_tag: image-text-to-text` and `library_name: transformers`.
- Added relevant tags for better discoverability.
- Linked the model to the official paper and GitHub repository.
- Included a descriptive summary and the official BibTeX citation.
These updates help researchers find and attribute your work correctly on the Hub!
README.md
CHANGED
|
@@ -1,3 +1,38 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- vllm
|
| 7 |
+
- grpo
|
| 8 |
+
- segmentation
|
| 9 |
+
- detection
|
| 10 |
+
- visual-reasoning
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
|
| 14 |
+
|
| 15 |
+
This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152).
|
| 16 |
+
|
| 17 |
+
Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.
|
| 18 |
+
|
| 19 |
+
## Links
|
| 20 |
+
- **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152)
|
| 21 |
+
- **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg)
|
| 22 |
+
|
| 23 |
+
## Model Description
|
| 24 |
+
Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.
|
| 25 |
+
|
| 26 |
+
## Citation
|
| 27 |
+
If you find this work useful, please cite:
|
| 28 |
+
```bibtex
|
| 29 |
+
@article{sun2026dr,
|
| 30 |
+
title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
|
| 31 |
+
author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
|
| 32 |
+
journal={arXiv preprint arXiv:2603.00152},
|
| 33 |
+
year={2026}
|
| 34 |
+
}
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
## Acknowledgements
|
| 38 |
+
This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).
|