File size: 2,476 Bytes
da78e06 157fbda da78e06 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vllm
- grpo
- segmentation
- detection
- visual-reasoning
---
# Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152).
Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.
## Links
- **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152)
- **Dataset:** [COCONut](https://huggingface.co/datasets/hao05/coconut)
- **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg)
## Model Description
Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.
## Citation
If you find this work useful, please cite:
```bibtex
@article{sun2026dr,
title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
journal={arXiv preprint arXiv:2603.00152},
year={2026}
}
```
## Acknowledgements
This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large). |