Osilly
/

Vision-R1-72B

Safetensors

Model card Files Files and versions

xet

Community

Improve model card and add metadata

by nielsr HF Staff - opened Mar 7

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+77

-3

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- reasoning
+- math
+- qwen2.5-vl
+- reinforcement-learning
+---
+# Vision-R1-72B
+Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.
+- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
+- **GitHub:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
+## Performance
+Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:
+| Model                      | MathVista   | MathVerse    | MathVerse (mini Vision_Only) | MM-Math      | DynaMath (Overall; Avg) | AVG.         |
+| -------------------------- | ----------- | ------------ | ---------------------------- | ------------ | ----------------------- | ------------ |
+| Qwen2.5-VL-72B             | 73.5        | 51.3         | 47.3                         | 45.6         | 61.2                    | 55.8         |
+| **Vision-R1-72B\* (Ours)** | 78.2 (+4.7) | 63.2 (+11.9) | 57.9 (+10.6)                 | 59.3 (+13.7) | 66.4 (+5.2)             | 65 (+9.2)    |
+\*: Vision-R1-72B used additional data in RL training.
+## Quickstart
+### Using 🤗 Transformers for Inference
+You can run inference using the scripts provided in the official repository. First, install the requirements:
+```bash
+pip install -r requirements.txt
+# Optional: install Flash Attention 2
+pip install -U flash-attn --no-build-isolation
+```
+Then, run the inference script:
+```bash
+MODEL_PATH="Osilly/Vision-R1-72B"
+TEMP=0.6
+TOP_P=0.95
+MAX_TOKENS=4096
+IMAGE_PATH="./path/to/your/image.png"
+PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
+Choices:
+A: 2π
+B: 3π
+C: 6π
+D: 8π"
+python3 inference.py \
+    --model_path ${MODEL_PATH}  \
+    --enable_flash_attn True \
+    --image_path ${IMAGE_PATH} \
+    --prompt "${PROMPT}" \
+    --max_tokens ${MAX_TOKENS} \
+    --temperature ${TEMP} \
+    --top_p ${TOP_P}
+```
+## Citation
+If you find our work helpful, please consider citing it:
+```bibtex
+@article{huang2025visionr1,
+  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
+  author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
+  journal={arXiv preprint arXiv:2503.06749},
+  year={2025}
+}
+```