| --- |
| license: mit |
| datasets: |
| - PAPOGalaxy/PAPO_ViRL39K_train |
| base_model: |
| - Qwen/Qwen2.5-VL-7B-Instruct |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| tags: |
| - VGPO |
| - Reinforcement learning |
| - Multimodal Reasoning |
| - Visual Attention Compensation |
| --- |
| |
|
|
| # Model Card for VGPO-RL-7B |
|
|
| ## π Overview of VGPO |
|
|
| Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to **signal dilution** β generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, **temporal visual forgetting** causes attention to visual inputs to progressively decay as reasoning chains extend. |
| |
| VGPO addresses these issues through three key mechanisms: |
| - **Visual Attention Compensation (VAC):** Uses the inherent hidden-state similarity between generated tokens and image tokens as a *Visual Focus Score* to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps. |
| - **Intra-Trajectory Re-weighting:** At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens. |
| - **Inter-Trajectory Re-weighting:** At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding. |
|
|
| ## π Model Sources |
|
|
| - **Github Repository:** [`VGPO`](https://github.com/wzb-bupt/VGPO) |
| - **ArXiv Paper:** [`2604.09349`](https://arxiv.org/abs/2604.09349) |
|
|
|
|
| ## π Training Datasets |
| |
| | Split | Dataset | Link | |
| |:------|:--------|:-----| |
| | Train | ViRL39K | [PAPOGalaxy/PAPO_ViRL39K_train](https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train) | |
| | Val | MMK12 | [PAPOGalaxy/PAPO_MMK12_test](https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test) | |
|
|
|
|
| ## π Evaluation |
| |
| We follow the evaluation script of [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). All results are reported as **average accuracy** with inference temperature **0.0**. |
|
|
| ### Supported Evaluation Benchmarks |
| |
| | Benchmark | Focus Domain | |
| |:--------------------|:-------------------------------------------| |
| | MathVista | General Mathematical & Geometric Reasoning | |
| | MathVerse | General Mathematical & Geometric Reasoning | |
| | WeMath | General Mathematical & Geometric Reasoning | |
| | MMK12 | General Mathematical & Geometric Reasoning | |
| | GeoMath | General Mathematical & Geometric Reasoning | |
| | Geometry3K | General Mathematical & Geometric Reasoning | |
| | LogicVista | Vision-dependent Multimodal Reasoning | |
| | SuperClevr Counting | Vision-dependent Multimodal Reasoning | |
| | MMMU-Pro | Vision-dependent Multimodal Reasoning | |
| | MathVerse-V | Vision-dependent Multimodal Reasoning | |
| |
|
|
| ## βοΈ Citation |
|
|
| If you find this codebase useful in your research, please consider giving us a star β and citing our work π: |
| |
| ```bibtex |
| @article{wang2026vgpo, |
| title={Visually-Guided Policy Optimization for Multimodal Reasoning}, |
| author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu}, |
| journal={arXiv preprint arXiv:2604.09349}, |
| year={2026} |
| } |
| ``` |
|
|
| ## β€οΈ Acknowledgements |
|
|
| Our codebase is built upon [EasyR1](https://github.com/hiyouga/EasyR1), [VPPO-RL](https://github.com/huaixuheqing/VPPO-RL), [PAPO](https://github.com/MikeWangWZHL/PAPO), [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). We thank the authors for their excellent work. |
|
|
|
|
|
|