--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - vllm - grpo - segmentation - detection - visual-reasoning --- # Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152). Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. ## Links - **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152) - **Dataset:** [COCONut](https://huggingface.co/datasets/hao05/coconut) - **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg) ## Model Description Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization. ## Citation If you find this work useful, please cite: ```bibtex @article{sun2026dr, title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design}, author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng}, journal={arXiv preprint arXiv:2603.00152}, year={2026} } ``` ## Acknowledgements This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).