Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
This repository contains the weights for Dr. Seg-7B, as presented in the paper Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design.
Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.
Links
- Paper: arXiv:2603.00152
- Dataset: COCONut
- Code: GitHub Repository
Model Description
Dr. Seg-7B is fine-tuned from Qwen2.5-VL-7B-Instruct using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.
Citation
If you find this work useful, please cite:
@article{sun2026dr,
title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
journal={arXiv preprint arXiv:2603.00152},
year={2026}
}
Acknowledgements
This project builds upon several open-source efforts, including VisionReasoner, Seg-Zero, EasyR1, veRL, and COCONut-PanCap. We also utilize pretrained models from Qwen2.5-VL and SAM2.
- Downloads last month
- 17