Improve model card: add metadata, paper/code links, performance and usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # Vision-R1-7B
8
+
9
+ Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
10
+
11
+ The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
12
+ 1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
13
+ 2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.
14
+
15
+ ## Links
16
+ - **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749)
17
+ - **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
18
+ - **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold)
19
+ - **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl)
20
+
21
+ ## Performance
22
+
23
+ | Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. |
24
+ | -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ |
25
+ | Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 |
26
+ | **Vision-R1-7B (Ours)** | **73.5 (+5.4)** | **52.4 (+5.7)** | **40.2 (+6.1)** | **56.3 (+5.6)** | **53.8 (+5.9)** |
27
+
28
+ ## Usage
29
+
30
+ ### Using 🤗 Transformers
31
+
32
+ To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1):
33
+
34
+ ```bash
35
+ # Inference script for Vision-R1-7B model using transformers
36
+ MODEL_PATH="Osilly/Vision-R1-7B"
37
+ TEMP=0.6
38
+ TOP_P=0.95
39
+ MAX_TOKENS=4096
40
+ IMAGE_PATH="./figs/example1.png"
41
+ PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
42
+ Choices:
43
+ A: 2π
44
+ B: 3π
45
+ C: 6π
46
+ D: 8π"
47
+
48
+ python3 inference.py \
49
+ --model_path ${MODEL_PATH} \
50
+ --enable_flash_attn True \
51
+ --image_path ${IMAGE_PATH} \
52
+ --prompt "${PROMPT}" \
53
+ --max_tokens ${MAX_TOKENS} \
54
+ --temperature ${TEMP} \
55
+ --top_p ${TOP_P}
56
+ ```
57
+
58
+ ## Citation
59
+ ```bibtex
60
+ @article{huang2025visionr1,
61
+ title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
62
+ author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
63
+ journal={arXiv preprint arXiv:2503.06749},
64
+ year={2025}
65
+ }
66
+ ```