# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("hao05/Dr_Seg")
model = AutoModelForImageTextToText.from_pretrained("hao05/Dr_Seg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
This repository contains the weights for Dr. Seg-7B, as presented in the paper Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design.
Dr. Seg is a plug-and-play GRPO-based framework designed to adapt Visual Large Language Models (VLLMs) for visual perception tasks such as reasoning segmentation and object detection. It introduces two key components: a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs.
Links
- Paper: arXiv:2603.00152
- Dataset: COCONut
- Code: GitHub Repository
Model Description
Dr. Seg-7B is fine-tuned from Qwen2.5-VL-7B-Instruct using perception-oriented designs. While standard GRPO is often tailored for language reasoning, Dr. Seg addresses the specific needs of visual perception by providing a broader output space and fine-grained, stable reward signals. Experiments demonstrate that Dr. Seg improves performance in complex visual scenarios while maintaining strong generalization.
Citation
If you find this work useful, please cite:
@article{sun2026dr,
title={Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design},
author={Sun, Haoxiang and Wang, Tao and Tang, Chenwei and Yuan, Li and Lv, Jiancheng},
journal={arXiv preprint arXiv:2603.00152},
year={2026}
}
Acknowledgements
This project builds upon several open-source efforts, including VisionReasoner, Seg-Zero, EasyR1, veRL, and COCONut-PanCap. We also utilize pretrained models from Qwen2.5-VL and SAM2.
- Downloads last month
- 11
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="hao05/Dr_Seg") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)