Vision-R1-72B / README.md
nielsr's picture
nielsr HF Staff
Improve model card and add metadata
f384e06 verified
|
raw
history blame
2.71 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - reasoning
  - math
  - qwen2.5-vl
  - reinforcement-learning

Vision-R1-72B

Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.

Performance

Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:

Model MathVista MathVerse MathVerse (mini Vision_Only) MM-Math DynaMath (Overall; Avg) AVG.
Qwen2.5-VL-72B 73.5 51.3 47.3 45.6 61.2 55.8
Vision-R1-72B* (Ours) 78.2 (+4.7) 63.2 (+11.9) 57.9 (+10.6) 59.3 (+13.7) 66.4 (+5.2) 65 (+9.2)

*: Vision-R1-72B used additional data in RL training.

Quickstart

Using 🤗 Transformers for Inference

You can run inference using the scripts provided in the official repository. First, install the requirements:

pip install -r requirements.txt
# Optional: install Flash Attention 2
pip install -U flash-attn --no-build-isolation

Then, run the inference script:

MODEL_PATH="Osilly/Vision-R1-72B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./path/to/your/image.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}

Citation

If you find our work helpful, please consider citing it:

@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}