--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text --- # Vision-R1-7B Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749). The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline: 1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging. 2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection. ## Links - **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749) - **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1) - **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold) - **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl) ## Performance | Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. | | -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ | | Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 | | **Vision-R1-7B (Ours)** | **73.5 (+5.4)** | **52.4 (+5.7)** | **40.2 (+6.1)** | **56.3 (+5.6)** | **53.8 (+5.9)** | ## Usage ### Using 🤗 Transformers To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1): ```bash # Inference script for Vision-R1-7B model using transformers MODEL_PATH="Osilly/Vision-R1-7B" TEMP=0.6 TOP_P=0.95 MAX_TOKENS=4096 IMAGE_PATH="./figs/example1.png" PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables. Choices: A: 2π B: 3π C: 6π D: 8π" python3 inference.py \ --model_path ${MODEL_PATH} \ --enable_flash_attn True \ --image_path ${IMAGE_PATH} \ --prompt "${PROMPT}" \ --max_tokens ${MAX_TOKENS} \ --temperature ${TEMP} \ --top_p ${TOP_P} ``` ## Citation ```bibtex @article{huang2025visionr1, title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models}, author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui}, journal={arXiv preprint arXiv:2503.06749}, year={2025} } ```