P2R-4B

This repository contains the P2R-4B, introduced in Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning.

Model Description

P2R-4B is a fine-grained visual reasoning model built upon Qwen3-VL-4B-Instruct. It performs inference under the P2R framework, a two-stage visual reasoning framework that decouples perception from reasoning. Training is powered by PRA-GRPO, a role-aware alternating RL strategy.

Model Performance

Model	V-Star	HR-Bench-4K	HR-Bench-8K	MME-RealWorld-Lite
Qwen3-VL-Instruct-4B	81.7	73.8	67.0	47.7
P2R-4B	93.2	81.9	80.5	54.8
Δ	+11.5	+8.1	+13.5	+7.1

Usage

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained("hongxingli/P2R-4B")
processor = AutoProcessor.from_pretrained("hongxingli/P2R-4B")

For the full two-stage P2R inference pipeline, please refer to our code repository.

Citation

@misc{li2026perceivetoreasondecouplingperceptionreasoning,
      title={Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning}, 
      author={Hongxing Li and Xiufeng Huang and Dingming Li and Wenjing Jiang and Zixuan Wang and Haolei Xu and Hanrong Zhang and Haiwen Hong and Longtao Huang and Hui Xue and Weiming Lu and Jun Xiao and Yueting Zhuang and Yongliang Shen},
      year={2026},
      eprint={2607.01191},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2607.01191}, 
}