MuMing0102
/

VGPO-RL-7B

Image-Text-to-Text

Reinforcement learning

Multimodal Reasoning

Visual Attention Compensation

text-generation-inference

Model card Files Files and versions

VGPO-RL-7B / README.md

MuMing0102's picture

Create README.md

3e90319 verified 3 days ago

|

history blame contribute delete

3.77 kB

	---
	license: mit
	datasets:
	- PAPOGalaxy/PAPO_ViRL39K_train
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- VGPO
	- Reinforcement learning
	- Multimodal Reasoning
	- Visual Attention Compensation
	---


	# Model Card for VGPO-RL-7B

	## 📖 Overview of VGPO

	Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution — generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

	VGPO addresses these issues through three key mechanisms:
	- Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
	- Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
	- Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

	## 🔗 Model Sources

	- Github Repository: [`VGPO`](https://github.com/wzb-bupt/VGPO)
	- ArXiv Paper: [`2604.09349`](https://arxiv.org/abs/2604.09349)


	## 📕 Training Datasets

	\| Split \| Dataset \| Link \|
	\|:------\|:--------\|:-----\|
	\| Train \| ViRL39K \| [PAPOGalaxy/PAPO_ViRL39K_train](https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train) \|
	\| Val \| MMK12 \| [PAPOGalaxy/PAPO_MMK12_test](https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test) \|


	## 📊 Evaluation

	We follow the evaluation script of [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). All results are reported as average accuracy with inference temperature 0.0.

	### Supported Evaluation Benchmarks

	\| Benchmark \| Focus Domain \|
	\|:--------------------\|:-------------------------------------------\|
	\| MathVista \| General Mathematical & Geometric Reasoning \|
	\| MathVerse \| General Mathematical & Geometric Reasoning \|
	\| WeMath \| General Mathematical & Geometric Reasoning \|
	\| MMK12 \| General Mathematical & Geometric Reasoning \|
	\| GeoMath \| General Mathematical & Geometric Reasoning \|
	\| Geometry3K \| General Mathematical & Geometric Reasoning \|
	\| LogicVista \| Vision-dependent Multimodal Reasoning \|
	\| SuperClevr Counting \| Vision-dependent Multimodal Reasoning \|
	\| MMMU-Pro \| Vision-dependent Multimodal Reasoning \|
	\| MathVerse-V \| Vision-dependent Multimodal Reasoning \|


	## ✍️ Citation

	If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work 📝:

	```bibtex
	@article{wang2026vgpo,
	title={Visually-Guided Policy Optimization for Multimodal Reasoning},
	author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
	journal={arXiv preprint arXiv:2604.09349},
	year={2026}
	}
	```

	## ❤️ Acknowledgements

	Our codebase is built upon [EasyR1](https://github.com/hiyouga/EasyR1), [VPPO-RL](https://github.com/huaixuheqing/VPPO-RL), [PAPO](https://github.com/MikeWangWZHL/PAPO), [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). We thank the authors for their excellent work.