Osilly
/

Vision-R1-72B

Model card Files Files and versions

Vision-R1-72B / README.md

nielsr's picture

nielsr HF Staff

Improve model card and add metadata

f384e06 verified 7 days ago

|

2.71 kB

	---
	license: apache-2.0
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- multimodal
	- reasoning
	- math
	- qwen2.5-vl
	- reinforcement-learning
	---

	# Vision-R1-72B

	Vision-R1 is a reasoning multimodal large language model (MLLM) designed to enhance reasoning capabilities through Reinforcement Learning (RL) and a novel Progressive Thinking Suppression Training (PTST) strategy. This repository contains the 72B parameter version.

	- Paper: [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
	- GitHub: [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)

	## Performance

	Vision-R1-72B achieves state-of-the-art results on multimodal math reasoning benchmarks:

	\| Model \| MathVista \| MathVerse \| MathVerse (mini Vision_Only) \| MM-Math \| DynaMath (Overall; Avg) \| AVG. \|
	\| -------------------------- \| ----------- \| ------------ \| ---------------------------- \| ------------ \| ----------------------- \| ------------ \|
	\| Qwen2.5-VL-72B \| 73.5 \| 51.3 \| 47.3 \| 45.6 \| 61.2 \| 55.8 \|
	\| *Vision-R1-72B\ (Ours)** \| 78.2 (+4.7) \| 63.2 (+11.9) \| 57.9 (+10.6) \| 59.3 (+13.7) \| 66.4 (+5.2) \| 65 (+9.2) \|

	\*: Vision-R1-72B used additional data in RL training.

	## Quickstart

	### Using 🤗 Transformers for Inference

	You can run inference using the scripts provided in the official repository. First, install the requirements:

	```bash
	pip install -r requirements.txt
	# Optional: install Flash Attention 2
	pip install -U flash-attn --no-build-isolation
	```

	Then, run the inference script:

	```bash
	MODEL_PATH="Osilly/Vision-R1-72B"
	TEMP=0.6
	TOP_P=0.95
	MAX_TOKENS=4096
	IMAGE_PATH="./path/to/your/image.png"
	PROMPT="Given a cone with a base radius represented by the variable 'r' (r = 1) and a slant height represented by the variable 's' (s = 3), determine the lateral surface area using variables.
	Choices:
	A: 2π
	B: 3π
	C: 6π
	D: 8π"

	python3 inference.py \
	--model_path ${MODEL_PATH} \
	--enable_flash_attn True \
	--image_path ${IMAGE_PATH} \
	--prompt "${PROMPT}" \
	--max_tokens ${MAX_TOKENS} \
	--temperature ${TEMP} \
	--top_p ${TOP_P}
	```

	## Citation
	If you find our work helpful, please consider citing it:
	```bibtex
	@article{huang2025visionr1,
	title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
	author={Wenxuan Huang and Bohan Jia and Zijie Zhai and Shaosheng Cao and Zheyu Ye and Fei Zhao and Yao Hu and Shaohui Lin},
	journal={arXiv preprint arXiv:2503.06749},
	year={2025}
	}
	```