File size: 2,714 Bytes

f384e06

---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal
- reasoning
- math
- qwen2.5-vl
- reinforcement-learning
---

# Vision-R1-72B

Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.

- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
- **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)

## Performance

Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:

| Model                      | MathVista   | MathVerse    | MathVerse (mini Vision_Only) | MM-Math      | DynaMath (Overall; Avg) | AVG.         |
| -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
| Qwen2.5-VL-72B             | 73.5        | 51.3         | 47.3                         | 45.6         | 61.2                    | 55.8         |
| **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6)                 | 59.3 (+13.7) | 66.4 (+5.2)             | 65 (+9.2)    |

\*: Vision-R1-72B used additional data in RL training.

## Quickstart

### Using 🤗 Transformers for Inference

You can run inference using the scripts provided in the official repository. First, install the requirements:

```bash
pip install -r requirements.txt
# Optional: install Flash Attention 2
pip install -U flash-attn --no-build-isolation
```

Then, run the inference script:

```bash
MODEL_PATH="Osilly/Vision-R1-72B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./path/to/your/image.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}
```

## Citation
If you find our work helpful, please consider citing it:
```bibtex
@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}
```