Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
datasets:
|
| 4 |
+
- PAPOGalaxy/PAPO_ViRL39K_train
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 7 |
+
pipeline_tag: image-text-to-text
|
| 8 |
+
library_name: transformers
|
| 9 |
+
tags:
|
| 10 |
+
- VGPO
|
| 11 |
+
- Reinforcement learning
|
| 12 |
+
- Multimodal Reasoning
|
| 13 |
+
- Visual Attention Compensation
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
# Model Card for VGPO-RL-7B
|
| 18 |
+
|
| 19 |
+
## ๐ Overview of VGPO
|
| 20 |
+
|
| 21 |
+
Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to **signal dilution** โ generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, **temporal visual forgetting** causes attention to visual inputs to progressively decay as reasoning chains extend.
|
| 22 |
+
|
| 23 |
+
VGPO addresses these issues through three key mechanisms:
|
| 24 |
+
- **Visual Attention Compensation (VAC):** Uses the inherent hidden-state similarity between generated tokens and image tokens as a *Visual Focus Score* to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
|
| 25 |
+
- **Intra-Trajectory Re-weighting:** At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
|
| 26 |
+
- **Inter-Trajectory Re-weighting:** At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.
|
| 27 |
+
|
| 28 |
+
## ๐ Model Sources
|
| 29 |
+
|
| 30 |
+
- **Github Repository:** [`VGPO`](https://github.com/wzb-bupt/VGPO)
|
| 31 |
+
- **ArXiv Paper:** [`2604.09349`](https://arxiv.org/abs/2604.09349)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
## ๐ Training Datasets
|
| 35 |
+
|
| 36 |
+
| Split | Dataset | Link |
|
| 37 |
+
|:------|:--------|:-----|
|
| 38 |
+
| Train | ViRL39K | [PAPOGalaxy/PAPO_ViRL39K_train](https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train) |
|
| 39 |
+
| Val | MMK12 | [PAPOGalaxy/PAPO_MMK12_test](https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test) |
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## ๐ Evaluation
|
| 43 |
+
|
| 44 |
+
We follow the evaluation script of [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). All results are reported as **average accuracy** with inference temperature **0.0**.
|
| 45 |
+
|
| 46 |
+
### Supported Evaluation Benchmarks
|
| 47 |
+
|
| 48 |
+
| Benchmark | Focus Domain |
|
| 49 |
+
|:--------------------|:-------------------------------------------|
|
| 50 |
+
| MathVista | General Mathematical & Geometric Reasoning |
|
| 51 |
+
| MathVerse | General Mathematical & Geometric Reasoning |
|
| 52 |
+
| WeMath | General Mathematical & Geometric Reasoning |
|
| 53 |
+
| MMK12 | General Mathematical & Geometric Reasoning |
|
| 54 |
+
| GeoMath | General Mathematical & Geometric Reasoning |
|
| 55 |
+
| Geometry3K | General Mathematical & Geometric Reasoning |
|
| 56 |
+
| LogicVista | Vision-dependent Multimodal Reasoning |
|
| 57 |
+
| SuperClevr Counting | Vision-dependent Multimodal Reasoning |
|
| 58 |
+
| MMMU-Pro | Vision-dependent Multimodal Reasoning |
|
| 59 |
+
| MathVerse-V | Vision-dependent Multimodal Reasoning |
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
## โ๏ธ Citation
|
| 63 |
+
|
| 64 |
+
If you find this codebase useful in your research, please consider giving us a star โญ and citing our work ๐:
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@article{wang2026vgpo,
|
| 68 |
+
title={Visually-Guided Policy Optimization for Multimodal Reasoning},
|
| 69 |
+
author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
|
| 70 |
+
journal={arXiv preprint arXiv:2604.09349},
|
| 71 |
+
year={2026}
|
| 72 |
+
}
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## โค๏ธ Acknowledgements
|
| 76 |
+
|
| 77 |
+
Our codebase is built upon [EasyR1](https://github.com/hiyouga/EasyR1), [VPPO-RL](https://github.com/huaixuheqing/VPPO-RL), [PAPO](https://github.com/MikeWangWZHL/PAPO), [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). We thank the authors for their excellent work.
|
| 78 |
+
|
| 79 |
+
|