csfufu
/

Revisual-R1-Coldstart

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

Revisual-R1-Coldstart / README.md

nielsr's picture

nielsr HF Staff

Update pipeline tag

0389ee1 verified 7 months ago

|

1.45 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- transformers
	- multimodal
	---

	## 🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner

	> One cold-start, two RL stages, endless reasoning power.

	---

	### 🔑 Highlights

	* SOTA on 9 tough benchmarks covering visual–math + text reasoning.
	* Three-Stage SRO Training

	1. Text Cold-Start — seed deep reflection
	2. Multimodal RL — align vision & logic
	3. Text RL — polish fluency & brevity
	* PAD (Prioritized Advantage Distillation) keeps gradients alive.
	* Efficient-Length Reward = concise, self-reflective CoT.

	---

	### 📚 Resources

	* [Paper](https://arxiv.org/abs/2506.04207)
	* [Code](https://github.com/CSfufu/Revisual-R1)


	---

	### 📌 Citation

	```bibtex
	@misc{chen2025advancingmultimodalreasoningoptimized,
	title = {Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning},
	author = {Shuang Chen and Yue Guo and Zhaochen Su and Yafu Li and Yulun Wu and Jiacheng Chen and
	Jiayu Chen and Weijie Wang and Xiaoye Qu and Yu Cheng},
	year = {2025},
	eprint = {2506.04207},
	archivePrefix = {arXiv},
	primaryClass = {cs.LG},
	url = {https://arxiv.org/abs/2506.04207}
	}
	```

	Take ReVisual-R1 for a spin and let us know what you build! 🎯