nielsr HF Staff commited on
Commit
723c215
·
verified ·
1 Parent(s): 7ecdc66

Add model card and metadata for Dr. Seg

Browse files

Hi! I'm Niels from the Hugging Face community science team. I've opened this PR to improve the model card for Dr. Seg.

Changes include:
- Added metadata for `pipeline_tag: image-text-to-text` and `library_name: transformers`.
- Added relevant tags for better discoverability.
- Linked the model to the official paper and GitHub repository.
- Included a descriptive summary and the official BibTeX citation.

These updates help researchers find and attribute your work correctly on the Hub!

Files changed (1) hide show
  1. README.md +38 -3
README.md CHANGED
@@ -1,3 +1,38 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - vllm
7
+ - grpo
8
+ - segmentation
9
+ - detection
10
+ - visual-reasoning
11
+ ---
12
+
13
+ # Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
14
+
15
+ This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152).
16
+
17
+ Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.
18
+
19
+ ## Links
20
+ - **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152)
21
+ - **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg)
22
+
23
+ ## Model Description
24
+ Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.
25
+
26
+ ## Citation
27
+ If you find this work useful, please cite:
28
+ ```bibtex
29
+ @article{sun2026dr,
30
+ title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
31
+ author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
32
+ journal={arXiv preprint arXiv:2603.00152},
33
+ year={2026}
34
+ }
35
+ ```
36
+
37
+ ## Acknowledgements
38
+ This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).