Vision-R1-7B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: add metadata, paper/code links, performance and usage
adb94c1 verified
|
raw
history blame
2.87 kB
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
---
# Vision-R1-7B
Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749).
The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline:
1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging.
2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection.
## Links
- **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749)
- **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
- **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold)
- **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl)
## Performance
| Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. |
| -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ |
| Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 |
| **Vision-R1-7B (Ours)** | **73.5 (+5.4)** | **52.4 (+5.7)** | **40.2 (+6.1)** | **56.3 (+5.6)** | **53.8 (+5.9)** |
## Usage
### Using 🤗 Transformers
To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1):
```bash
# Inference script for Vision-R1-7B model using transformers
MODEL_PATH="Osilly/Vision-R1-7B"
TEMP=0.6
TOP_P=0.95
MAX_TOKENS=4096
IMAGE_PATH="./figs/example1.png"
PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
Choices:
A: 2π
B: 3π
C: 6π
D: 8π"
python3 inference.py \
--model_path ${MODEL_PATH} \
--enable_flash_attn True \
--image_path ${IMAGE_PATH} \
--prompt "${PROMPT}" \
--max_tokens ${MAX_TOKENS} \
--temperature ${TEMP} \
--top_p ${TOP_P}
```
## Citation
```bibtex
@article{huang2025visionr1,
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
journal={arXiv preprint arXiv:2503.06749},
year={2025}
}
```