| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - vllm |
| | - grpo |
| | - segmentation |
| | - detection |
| | - visual-reasoning |
| | --- |
| | |
| | # Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design |
| |
|
| | This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152). |
| |
|
| | Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. |
| |
|
| | ## Links |
| | - **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152) |
| | - **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg) |
| |
|
| | ## Model Description |
| | Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization. |
| |
|
| | ## Citation |
| | If you find this work useful, please cite: |
| | ```bibtex |
| | @article{sun2026dr, |
| | title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design}, |
| | author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng}, |
| | journal={arXiv preprint arXiv:2603.00152}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgements |
| | This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large). |