Osilly
/

Vision-R1-7B

Safetensors

Model card Files Files and versions

xet

Community

Improve model card: add metadata, paper/code links, performance and usage

by nielsr HF Staff - opened Mar 7

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+66

-3

Files changed (1) hide show

README.md +66 -3

README.md CHANGED Viewed

@@ -1,3 +1,66 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+# Vision-R1-7B
+Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
+The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
+1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
+2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.
+## Links
+- **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749)
+- **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
+- **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold)
+- **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl)
+## Performance
+| Model                      | MathVista   | MathVerse    | MM-Math      | DynaMath (Overall; Avg) | AVG.         |
+| -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ |
+| Qwen2.5-VL-7B              | 68.1        | 46.7         | 34.1         | 50.7                    | 47.9         |
+| **Vision-R1-7B (Ours)**    | **73.5 (+5.4)** | **52.4 (+5.7)**  | **40.2 (+6.1)**  | **56.3 (+5.6)**             | **53.8 (+5.9)**  |
+## Usage
+### Using 🤗 Transformers
+To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1):
+```bash
+# Inference script for Vision-R1-7B model using transformers
+MODEL_PATH="Osilly/Vision-R1-7B"
+TEMP=0.6
+TOP_P=0.95
+MAX_TOKENS=4096
+IMAGE_PATH="./figs/example1.png"
+PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
+Choices:
+A: 2π
+B: 3π
+C: 6π
+D: 8π"
+python3 inference.py \
+    --model_path ${MODEL_PATH}  \
+    --enable_flash_attn True \
+    --image_path ${IMAGE_PATH} \
+    --prompt "${PROMPT}" \
+    --max_tokens ${MAX_TOKENS} \
+    --temperature ${TEMP} \
+    --top_p ${TOP_P}
+```
+## Citation
+```bibtex
+@article{huang2025visionr1,
+  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
+  author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
+  journal={arXiv preprint arXiv:2503.06749},
+  year={2025}
+}
+```