File size: 2,714 Bytes
f384e06 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal
- reasoning
- math
- qwen2.5-vl
- reinforcement-learning
---
# Vision-R1-72B
Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
- **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
## Performance
Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
| Model | MathVista | MathVerse | MathVerse (mini Vision_Only) | MM-Math | DynaMath (Overall; Avg) | AVG. |
| -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
| Qwen2.5-VL-72B | 73.5 | 51.3 | 47.3 | 45.6 | 61.2 | 55.8 |
| **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6) | 59.3 (+13.7) | 66.4 (+5.2) | 65 (+9.2) |
\*: Vision-R1-72B used additional data in RL training.
## Quickstart
### Using 🤗 Transformers for Inference
You can run inference using the scripts provided in the official repository. First, install the requirements:
```bash
pip install -r requirements.txt
# Optional: install Flash Attention 2
pip install -U flash-attn --no-build-isolation
```
Then, run the inference script:
```bash
MODEL_PATH="Osilly/Vision-R1-72B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./path/to/your/image.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"
python3 inference.py \
--model_path ${MODEL_PATH} \
--enable_flash_attn True \
--image_path ${IMAGE_PATH} \
--prompt "${PROMPT}" \
--max_tokens ${MAX_TOKENS} \
--temperature ${TEMP} \
--top_p ${TOP_P}
```
## Citation
If you find our work helpful, please consider citing it:
```bibtex
@article{huang2025visionr1,
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
journal={arXiv preprint arXiv:2503.06749},
year={2025}
}
``` |