| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - multimodal |
| | - reasoning |
| | - math |
| | - qwen2.5-vl |
| | - reinforcement-learning |
| | --- |
| | |
| | # Vision-R1-72B |
| |
|
| | Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version. |
| |
|
| | - **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) |
| | - **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1) |
| |
|
| | ## Performance |
| |
|
| | Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks: |
| |
|
| | | Model | MathVista | MathVerse | MathVerse (mini Vision_Only) | MM-Math | DynaMath (Overall; Avg) | AVG. | |
| | | -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ | |
| | | Qwen2.5-VL-72B | 73.5 | 51.3 | 47.3 | 45.6 | 61.2 | 55.8 | |
| | | **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6) | 59.3 (+13.7) | 66.4 (+5.2) | 65 (+9.2) | |
| | |
| | \*: Vision-R1-72B used additional data in RL training. |
| | |
| | ## Quickstart |
| | |
| | ### Using 🤗 Transformers for Inference |
| | |
| | You can run inference using the scripts provided in the official repository. First, install the requirements: |
| | |
| | ```bash |
| | pip install -r requirements.txt |
| | # Optional: install Flash Attention 2 |
| | pip install -U flash-attn --no-build-isolation |
| | ``` |
| | |
| | Then, run the inference script: |
| | |
| | ```bash |
| | MODEL_PATH="Osilly/Vision-R1-72B" |
| | TEMP=0.6 |
| | TOP_P=0.95 |
| | MAX_TOKENS=4096 |
| | IMAGE_PATH="./path/to/your/image.png" |
| | PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables. |
| | Choices: |
| | A: 2π |
| | B: 3π |
| | C: 6π |
| | D: 8π" |
| | |
| | python3 inference.py \ |
| | --model_path ${MODEL_PATH} \ |
| | --enable_flash_attn True \ |
| | --image_path ${IMAGE_PATH} \ |
| | --prompt "${PROMPT}" \ |
| | --max_tokens ${MAX_TOKENS} \ |
| | --temperature ${TEMP} \ |
| | --top_p ${TOP_P} |
| | ``` |
| | |
| | ## Citation |
| | If you find our work helpful, please consider citing it: |
| | ```bibtex |
| | @article{huang2025visionr1, |
| | title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models}, |
| | author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin}, |
| | journal={arXiv preprint arXiv:2503.06749}, |
| | year={2025} |
| | } |
| | ``` |