Vision-R1-72B / README.md

nielsr HF Staff

Improve model card and add metadata

f384e06 verified 7 days ago

2.71 kB

license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - reasoning
  - math
  - qwen2.5-vl
  - reinforcement-learning

Vision-R1-72B

Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.

Paper: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
GitHub: Osilly/Vision-R1

Performance

Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:

Model	MathVista	MathVerse	MathVerse (mini Vision_Only)	MM-Math	DynaMath (Overall; Avg)	AVG.
Qwen2.5-VL-72B	73.5	51.3	47.3	45.6	61.2	55.8
*Vision-R1-72B (Ours)**	78.2 (+4.7)	63.2 (+11.9)	57.9 (+10.6)	59.3 (+13.7)	66.4 (+5.2)	65 (+9.2)

*: Vision-R1-72B used additional data in RL training.

Quickstart

Using 🤗 Transformers for Inference

You can run inference using the scripts provided in the official repository. First, install the requirements:

pip install -r requirements.txt
# Optional: install Flash Attention 2
pip install -U flash-attn --no-build-isolation

Then, run the inference script:

MODEL_PATH="Osilly/Vision-R1-72B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./path/to/your/image.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}

Citation

If you find our work helpful, please consider citing it:

@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}