File size: 2,476 Bytes

---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vllm
- grpo
- segmentation
- detection
- visual-reasoning
---

# Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

This repository contains the weights for **Dr. Seg-7B**, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152).

Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a **Look-to-Confirm** mechanism and a **Distribution-Ranked Reward** module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.

## Links
- **Paper:** [arXiv:2603.00152](https://arxiv.org/abs/2603.00152)
- **Dataset:** [COCONut](https://huggingface.co/datasets/hao05/coconut)
- **Code:** [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg)

## Model Description
Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.

## Citation
If you find this work useful, please cite:
```bibtex
@article{sun2026dr,
  title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
  author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
  journal={arXiv preprint arXiv:2603.00152},
  year={2026}
}
```

## Acknowledgements
This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).