Osilly
/

Vision-R1-32B

Model card Files Files and versions

Vision-R1-32B / README.md

nielsr's picture

nielsr HF Staff

Add model card for Vision-R1-32B

930e07b verified 26 days ago

|

3.05 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	- reasoning
	- math
	- r1
	---

	# Vision-R1-32B

	Vision-R1-32B is a multimodal reasoning model introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749). It is based on the Qwen2.5-VL-32B architecture and is specifically optimized to enhance reasoning capabilities (such as self-reflection and questioning) in multimodal tasks.

	- Paper: [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
	- Repository: [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)

	## Model Description

	Vision-R1 addresses the difficulty of activating complex reasoning in MLLMs without human-annotated reasoning data. The model was developed using a two-stage pipeline:
	1. Cold-start Initialization: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold).
	2. Reinforcement Learning (RL): Utilizing Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy. This strategy gradually increases the reasoning length (4K -> 8K -> 16K) to refine the model's ability to learn complex reasoning processes.

	## Performance

	Vision-R1-32B demonstrates strong performance across various multimodal math reasoning benchmarks, significantly outperforming its base model:

	\| Model \| MathVista \| MathVerse \| MathVerse (mini) \| MM-Math \| DynaMath (Avg) \| AVG. \|
	\| -------------------------- \| ----------- \| ------------ \| ---------------- \| ------------ \| -------------- \| ------------ \|
	\| Qwen2.5-VL-32B \| 72.9 \| 52.3 \| 47.6 \| 34.9 \| 55.5 \| 52.6 \|
	\| Vision-R1-32B (Ours) \| 76.4 \| 62.1 \| 59.0 \| 55.3 \| 65.6 \| 63.7 \|

	## Quickstart

	### Inference via Transformers

	You can use the inference script provided in the [official repository](https://github.com/Osilly/Vision-R1).

	```bash
	# Inference script for Vision-R1-32B model
	MODEL_PATH="Osilly/Vision-R1-32B"
	IMAGE_PATH="path/to/your/image.png"
	PROMPT="Your math problem or question here."

	python3 inference.py \
	--model_path ${MODEL_PATH} \
	--enable_flash_attn True \
	--image_path ${IMAGE_PATH} \
	--prompt "${PROMPT}" \
	--max_tokens 4096 \
	--temperature 0.6 \
	--top_p 0.95
	```

	The model is also compatible with vLLM (version > 0.7.2) for faster deployment and local inference.

	## Citation

	If you find Vision-R1 useful, please cite the following paper:

	```bibtex
	@article{huang2025visionr1,
	title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
	author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
	journal={arXiv preprint arXiv:2503.06749},
	year={2025}
	}
	```