Improve model card: add metadata, paper/code links, performance and usage
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,66 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Vision-R1-7B
|
| 8 |
+
|
| 9 |
+
Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
|
| 10 |
+
|
| 11 |
+
The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
|
| 12 |
+
1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
|
| 13 |
+
2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.
|
| 14 |
+
|
| 15 |
+
## Links
|
| 16 |
+
- **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749)
|
| 17 |
+
- **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
|
| 18 |
+
- **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold)
|
| 19 |
+
- **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl)
|
| 20 |
+
|
| 21 |
+
## Performance
|
| 22 |
+
|
| 23 |
+
| Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. |
|
| 24 |
+
| -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ |
|
| 25 |
+
| Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 |
|
| 26 |
+
| **Vision-R1-7B (Ours)** | **73.5 (+5.4)** | **52.4 (+5.7)** | **40.2 (+6.1)** | **56.3 (+5.6)** | **53.8 (+5.9)** |
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
### Using 🤗 Transformers
|
| 31 |
+
|
| 32 |
+
To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1):
|
| 33 |
+
|
| 34 |
+
```bash
|
| 35 |
+
# Inference script for Vision-R1-7B model using transformers
|
| 36 |
+
MODEL_PATH="Osilly/Vision-R1-7B"
|
| 37 |
+
TEMP=0.6
|
| 38 |
+
TOP_P=0.95
|
| 39 |
+
MAX_TOKENS=4096
|
| 40 |
+
IMAGE_PATH="./figs/example1.png"
|
| 41 |
+
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
|
| 42 |
+
Choices:
|
| 43 |
+
A: 2π
|
| 44 |
+
B: 3π
|
| 45 |
+
C: 6π
|
| 46 |
+
D: 8π"
|
| 47 |
+
|
| 48 |
+
python3 inference.py \
|
| 49 |
+
--model_path ${MODEL_PATH} \
|
| 50 |
+
--enable_flash_attn True \
|
| 51 |
+
--image_path ${IMAGE_PATH} \
|
| 52 |
+
--prompt "${PROMPT}" \
|
| 53 |
+
--max_tokens ${MAX_TOKENS} \
|
| 54 |
+
--temperature ${TEMP} \
|
| 55 |
+
--top_p ${TOP_P}
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Citation
|
| 59 |
+
```bibtex
|
| 60 |
+
@article{huang2025visionr1,
|
| 61 |
+
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
|
| 62 |
+
author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
|
| 63 |
+
journal={arXiv preprint arXiv:2503.06749},
|
| 64 |
+
year={2025}
|
| 65 |
+
}
|
| 66 |
+
```
|