File size: 2,867 Bytes
adb94c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---

# Vision-R1-7B

Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).

The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.

## Links
- **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749)
- **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
- **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold)
- **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl)

## Performance

| Model                      | MathVista   | MathVerse    | MM-Math      | DynaMath (Overall; Avg) | AVG.         |
| -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ |
| Qwen2.5-VL-7B              | 68.1        | 46.7         | 34.1         | 50.7                    | 47.9         |
| **Vision-R1-7B (Ours)**    | **73.5 (+5.4)** | **52.4 (+5.7)**  | **40.2 (+6.1)**  | **56.3 (+5.6)**             | **53.8 (+5.9)**  |

## Usage

### Using 🤗 Transformers

To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1):

```bash
# Inference script for Vision-R1-7B model using transformers
MODEL_PATH="Osilly/Vision-R1-7B" 
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./figs/example1.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens ${MAX_TOKENS} \
    --temperature ${TEMP} \
    --top_p ${TOP_P}
```

## Citation
```bibtex
@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}
```