Improve model card and add metadata
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,77 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal
|
| 7 |
+
- reasoning
|
| 8 |
+
- math
|
| 9 |
+
- qwen2.5-vl
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# Vision-R1-72B
|
| 14 |
+
|
| 15 |
+
Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
|
| 16 |
+
|
| 17 |
+
- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
|
| 18 |
+
- **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
|
| 19 |
+
|
| 20 |
+
## Performance
|
| 21 |
+
|
| 22 |
+
Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
|
| 23 |
+
|
| 24 |
+
| Model | MathVista | MathVerse | MathVerse (mini Vision_Only) | MM-Math | DynaMath (Overall; Avg) | AVG. |
|
| 25 |
+
| -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
|
| 26 |
+
| Qwen2.5-VL-72B | 73.5 | 51.3 | 47.3 | 45.6 | 61.2 | 55.8 |
|
| 27 |
+
| **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6) | 59.3 (+13.7) | 66.4 (+5.2) | 65 (+9.2) |
|
| 28 |
+
|
| 29 |
+
\*: Vision-R1-72B used additional data in RL training.
|
| 30 |
+
|
| 31 |
+
## Quickstart
|
| 32 |
+
|
| 33 |
+
### Using 🤗 Transformers for Inference
|
| 34 |
+
|
| 35 |
+
You can run inference using the scripts provided in the official repository. First, install the requirements:
|
| 36 |
+
|
| 37 |
+
```bash
|
| 38 |
+
pip install -r requirements.txt
|
| 39 |
+
# Optional: install Flash Attention 2
|
| 40 |
+
pip install -U flash-attn --no-build-isolation
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
Then, run the inference script:
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
MODEL_PATH="Osilly/Vision-R1-72B"
|
| 47 |
+
TEMP=0.6
|
| 48 |
+
TOP_P=0.95
|
| 49 |
+
MAX_TOKENS=4096
|
| 50 |
+
IMAGE_PATH="./path/to/your/image.png"
|
| 51 |
+
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
|
| 52 |
+
Choices:
|
| 53 |
+
A: 2π
|
| 54 |
+
B: 3π
|
| 55 |
+
C: 6π
|
| 56 |
+
D: 8π"
|
| 57 |
+
|
| 58 |
+
python3 inference.py \
|
| 59 |
+
--model_path ${MODEL_PATH} \
|
| 60 |
+
--enable_flash_attn True \
|
| 61 |
+
--image_path ${IMAGE_PATH} \
|
| 62 |
+
--prompt "${PROMPT}" \
|
| 63 |
+
--max_tokens ${MAX_TOKENS} \
|
| 64 |
+
--temperature ${TEMP} \
|
| 65 |
+
--top_p ${TOP_P}
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Citation
|
| 69 |
+
If you find our work helpful, please consider citing it:
|
| 70 |
+
```bibtex
|
| 71 |
+
@article{huang2025visionr1,
|
| 72 |
+
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
|
| 73 |
+
author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
|
| 74 |
+
journal={arXiv preprint arXiv:2503.06749},
|
| 75 |
+
year={2025}
|
| 76 |
+
}
|
| 77 |
+
```
|