Vision-R1-7B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: add metadata, paper/code links, performance and usage
adb94c1 verified
|
raw
history blame
2.87 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text

Vision-R1-7B

Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models.

The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:

  1. Cold-start Initialization: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
  2. Reinforcement Learning: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.

Links

Performance

Model MathVista MathVerse MM-Math DynaMath (Overall; Avg) AVG.
Qwen2.5-VL-7B 68.1 46.7 34.1 50.7 47.9
Vision-R1-7B (Ours) 73.5 (+5.4) 52.4 (+5.7) 40.2 (+6.1) 56.3 (+5.6) 53.8 (+5.9)

Usage

Using 🤗 Transformers

To run inference using the script provided in the official repository:

# Inference script for Vision-R1-7B model using transformers
MODEL_PATH="Osilly/Vision-R1-7B" 
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./figs/example1.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}

Citation

@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}