metadata
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
Vision-R1-7B
Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models.
The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
- Cold-start Initialization: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
- Reinforcement Learning: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.
Links
- Paper: https://huggingface.co/papers/2503.06749
- Code: https://github.com/Osilly/Vision-R1
- Cold-start Dataset: Osilly/Vision-R1-cold
- RL Dataset: Osilly/Vision-R1-rl
Performance
| Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 |
| Vision-R1-7B (Ours) | 73.5 (+5.4) | 52.4 (+5.7) | 40.2 (+6.1) | 56.3 (+5.6) | 53.8 (+5.9) |
Usage
Using 🤗 Transformers
To run inference using the script provided in the official repository:
# Inference script for Vision-R1-7B model using transformers
MODEL_PATH="Osilly/Vision-R1-7B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./figs/example1.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"
python3 inference.py \
--model_path ${MODEL_PATH} \
--enable_flash_attn True \
--image_path ${IMAGE_PATH} \
--prompt "${PROMPT}" \
--max_tokens ${MAX_TOKENS} \
--temperature ${TEMP} \
--top_p ${TOP_P}
Citation
@article{huang2025visionr1,
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
journal={arXiv preprint arXiv:2503.06749},
year={2025}
}