nielsr HF Staff commited on
Commit
f384e06
·
verified ·
1 Parent(s): 52300e6

Improve model card and add metadata

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

I've opened this PR to improve the model card for Vision-R1-72B. The updates include:
- Adding YAML metadata for `library_name` (`transformers`), `pipeline_tag` (`image-text-to-text`), and relevant tags.
- Linking the model to the original paper: [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
- Adding the official GitHub repository link.
- Including a performance summary and sample usage instructions found in the documentation.

These additions will make the model more discoverable and provide better context for users on the Hub.

Files changed (1) hide show
  1. README.md +77 -3
README.md CHANGED
@@ -1,3 +1,77 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - multimodal
7
+ - reasoning
8
+ - math
9
+ - qwen2.5-vl
10
+ - reinforcement-learning
11
+ ---
12
+
13
+ # Vision-R1-72B
14
+
15
+ Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
16
+
17
+ - **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
18
+ - **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
19
+
20
+ ## Performance
21
+
22
+ Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
23
+
24
+ | Model | MathVista | MathVerse | MathVerse (mini Vision_Only) | MM-Math | DynaMath (Overall; Avg) | AVG. |
25
+ | -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
26
+ | Qwen2.5-VL-72B | 73.5 | 51.3 | 47.3 | 45.6 | 61.2 | 55.8 |
27
+ | **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6) | 59.3 (+13.7) | 66.4 (+5.2) | 65 (+9.2) |
28
+
29
+ \*: Vision-R1-72B used additional data in RL training.
30
+
31
+ ## Quickstart
32
+
33
+ ### Using 🤗 Transformers for Inference
34
+
35
+ You can run inference using the scripts provided in the official repository. First, install the requirements:
36
+
37
+ ```bash
38
+ pip install -r requirements.txt
39
+ # Optional: install Flash Attention 2
40
+ pip install -U flash-attn --no-build-isolation
41
+ ```
42
+
43
+ Then, run the inference script:
44
+
45
+ ```bash
46
+ MODEL_PATH="Osilly/Vision-R1-72B"
47
+ TEMP=0.6
48
+ TOP_P=0.95
49
+ MAX_TOKENS=4096
50
+ IMAGE_PATH="./path/to/your/image.png"
51
+ PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
52
+ Choices:
53
+ A: 2π
54
+ B: 3π
55
+ C: 6π
56
+ D: 8π"
57
+
58
+ python3 inference.py \
59
+ --model_path ${MODEL_PATH} \
60
+ --enable_flash_attn True \
61
+ --image_path ${IMAGE_PATH} \
62
+ --prompt "${PROMPT}" \
63
+ --max_tokens ${MAX_TOKENS} \
64
+ --temperature ${TEMP} \
65
+ --top_p ${TOP_P}
66
+ ```
67
+
68
+ ## Citation
69
+ If you find our work helpful, please consider citing it:
70
+ ```bibtex
71
+ @article{huang2025visionr1,
72
+ title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
73
+ author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
74
+ journal={arXiv preprint arXiv:2503.06749},
75
+ year={2025}
76
+ }
77
+ ```