Improve model card and add metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - multimodal
7
+ - reasoning
8
+ - math
9
+ - qwen2.5-vl
10
+ - reinforcement-learning
11
+ ---
12
+
13
+ # Vision-R1-72B
14
+
15
+ Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
16
+
17
+ - **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
18
+ - **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
19
+
20
+ ## Performance
21
+
22
+ Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
23
+
24
+ | Model | MathVista | MathVerse | MathVerse (mini Vision_Only) | MM-Math | DynaMath (Overall; Avg) | AVG. |
25
+ | -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
26
+ | Qwen2.5-VL-72B | 73.5 | 51.3 | 47.3 | 45.6 | 61.2 | 55.8 |
27
+ | **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6) | 59.3 (+13.7) | 66.4 (+5.2) | 65 (+9.2) |
28
+
29
+ \*: Vision-R1-72B used additional data in RL training.
30
+
31
+ ## Quickstart
32
+
33
+ ### Using 🤗 Transformers for Inference
34
+
35
+ You can run inference using the scripts provided in the official repository. First, install the requirements:
36
+
37
+ ```bash
38
+ pip install -r requirements.txt
39
+ # Optional: install Flash Attention 2
40
+ pip install -U flash-attn --no-build-isolation
41
+ ```
42
+
43
+ Then, run the inference script:
44
+
45
+ ```bash
46
+ MODEL_PATH="Osilly/Vision-R1-72B"
47
+ TEMP=0.6
48
+ TOP_P=0.95
49
+ MAX_TOKENS=4096
50
+ IMAGE_PATH="./path/to/your/image.png"
51
+ PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
52
+ Choices:
53
+ A: 2π
54
+ B: 3π
55
+ C: 6π
56
+ D: 8π"
57
+
58
+ python3 inference.py \
59
+ --model_path ${MODEL_PATH} \
60
+ --enable_flash_attn True \
61
+ --image_path ${IMAGE_PATH} \
62
+ --prompt "${PROMPT}" \
63
+ --max_tokens ${MAX_TOKENS} \
64
+ --temperature ${TEMP} \
65
+ --top_p ${TOP_P}
66
+ ```
67
+
68
+ ## Citation
69
+ If you find our work helpful, please consider citing it:
70
+ ```bibtex
71
+ @article{huang2025visionr1,
72
+ title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
73
+ author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
74
+ journal={arXiv preprint arXiv:2503.06749},
75
+ year={2025}
76
+ }
77
+ ```