Vision-R1-7B / README.md

nielsr HF Staff

Improve model card: add metadata, paper/code links, performance and usage

adb94c1 verified 7 days ago

2.87 kB

license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text

Vision-R1-7B

Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models.

The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:

Cold-start Initialization: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
Reinforcement Learning: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.

Performance

Model	MathVista	MathVerse	MM-Math	DynaMath (Overall; Avg)	AVG.
Qwen2.5-VL-7B	68.1	46.7	34.1	50.7	47.9
Vision-R1-7B (Ours)	73.5 (+5.4)	52.4 (+5.7)	40.2 (+6.1)	56.3 (+5.6)	53.8 (+5.9)

Usage

Using 🤗 Transformers

To run inference using the script provided in the official repository:

# Inference script for Vision-R1-7B model using transformers
MODEL_PATH="Osilly/Vision-R1-7B" 
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./figs/example1.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}

Citation

@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}

Osilly
/

Vision-R1-7B

Vision-R1-7B

Links

Performance

Usage

Using 🤗 Transformers

Citation