hao05
/

Dr_Seg

Image-Text-to-Text

visual-reasoning

text-generation-inference

Model card Files Files and versions

Dr_Seg / README.md

nielsr's picture

nielsr HF Staff

Add model card and metadata for Dr. Seg

723c215 verified 3 days ago

|

2.4 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- vllm
	- grpo
	- segmentation
	- detection
	- visual-reasoning
	---

	# Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

	This repository contains the weights for Dr. Seg-7B, as presented in the paper [Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design](https://arxiv.org/abs/2603.00152).

	Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.

	## Links
	- Paper: [arXiv:2603.00152](https://arxiv.org/abs/2603.00152)
	- Code: [GitHub Repository](https://github.com/eVI-group-SCU/Dr-Seg)

	## Model Description
	Dr. Seg-7B is fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.

	## Citation
	If you find this work useful, please cite:
	```bibtex
	@article{sun2026dr,
	title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
	author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
	journal={arXiv preprint arXiv:2603.00152},
	year={2026}
	}
	```

	## Acknowledgements
	This project builds upon several open-source efforts, including [VisionReasoner](https://github.com/JIA-Lab-research/VisionReasoner), [Seg-Zero](https://github.com/JIA-Lab-research/Seg-Zero), [EasyR1](https://github.com/hiyouga/EasyR1), [veRL](https://github.com/volcengine/verl), and [COCONut-PanCap](https://github.com/bytedance/coconut_cvpr2024). We also utilize pretrained models from [Qwen2.5-VL](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) and [SAM2](https://huggingface.co/facebook/sam2-hiera-large).