| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | # Vision-R1-7B |
| |
|
| | Vision-R1 is a multimodal reasoning model designed to improve reasoning capabilities in MLLMs. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749). |
| |
|
| | The model is built upon the Qwen2.5-VL-7B architecture and fine-tuned using a specialized two-stage pipeline: |
| | 1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold) constructed via modality bridging. |
| | 2. **Reinforcement Learning**: Using Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy to incentivize complex reasoning processes like questioning and self-reflection. |
| |
|
| | ## Links |
| | - **Paper**: [https://huggingface.co/papers/2503.06749](https://huggingface.co/papers/2503.06749) |
| | - **Code**: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1) |
| | - **Cold-start Dataset**: [Osilly/Vision-R1-cold](https://huggingface.co/datasets/Osilly/Vision-R1-cold) |
| | - **RL Dataset**: [Osilly/Vision-R1-rl](https://huggingface.co/datasets/Osilly/Vision-R1-rl) |
| |
|
| | ## Performance |
| |
|
| | | Model | MathVista | MathVerse | MM-Math | DynaMath (Overall; Avg) | AVG. | |
| | | -------------------------- | ----------- | ------------ | ------------ | ----------------------- | ------------ | |
| | | Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 | |
| | | **Vision-R1-7B (Ours)** | **73.5 (+5.4)** | **52.4 (+5.7)** | **40.2 (+6.1)** | **56.3 (+5.6)** | **53.8 (+5.9)** | |
| |
|
| | ## Usage |
| |
|
| | ### Using 🤗 Transformers |
| |
|
| | To run inference using the script provided in the [official repository](https://github.com/Osilly/Vision-R1): |
| |
|
| | ```bash |
| | # Inference script for Vision-R1-7B model using transformers |
| | MODEL_PATH="Osilly/Vision-R1-7B" |
| | TEMP=0.6 |
| | TOP_P=0.95 |
| | MAX_TOKENS=4096 |
| | IMAGE_PATH="./figs/example1.png" |
| | PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables. |
| | Choices: |
| | A: 2π |
| | B: 3π |
| | C: 6π |
| | D: 8π" |
| | |
| | python3 inference.py \ |
| | --model_path ${MODEL_PATH} \ |
| | --enable_flash_attn True \ |
| | --image_path ${IMAGE_PATH} \ |
| | --prompt "${PROMPT}" \ |
| | --max_tokens ${MAX_TOKENS} \ |
| | --temperature ${TEMP} \ |
| | --top_p ${TOP_P} |
| | ``` |
| |
|
| | ## Citation |
| | ```bibtex |
| | @article{huang2025visionr1, |
| | title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models}, |
| | author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui}, |
| | journal={arXiv preprint arXiv:2503.06749}, |
| | year={2025} |
| | } |
| | ``` |