File size: 3,766 Bytes
3e90319
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: mit
datasets:
- PAPOGalaxy/PAPO_ViRL39K_train
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- VGPO
- Reinforcement learning
- Multimodal Reasoning
- Visual Attention Compensation
---


# Model Card for VGPO-RL-7B

## πŸ“– Overview of VGPO

  Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to **signal dilution** β€” generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, **temporal visual forgetting** causes attention to visual inputs to progressively decay as reasoning chains extend.
  
  VGPO addresses these issues through three key mechanisms:
  - **Visual Attention Compensation (VAC):** Uses the inherent hidden-state similarity between generated tokens and image tokens as a *Visual Focus Score* to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
  - **Intra-Trajectory Re-weighting:** At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
  - **Inter-Trajectory Re-weighting:** At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

## πŸ”— Model Sources

- **Github Repository:** [`VGPO`](https://github.com/wzb-bupt/VGPO)
- **ArXiv Paper:** [`2604.09349`](https://arxiv.org/abs/2604.09349)


## πŸ“• Training Datasets
  
  | Split | Dataset | Link |
  |:------|:--------|:-----|
  | Train | ViRL39K | [PAPOGalaxy/PAPO_ViRL39K_train](https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train) |
  | Val   | MMK12   | [PAPOGalaxy/PAPO_MMK12_test](https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test) |


## πŸ“Š Evaluation
  
  We follow the evaluation script of [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). All results are reported as **average accuracy** with inference temperature **0.0**. 

  ### Supported Evaluation Benchmarks
  
  | Benchmark           | Focus Domain                               |
  |:--------------------|:-------------------------------------------|
  | MathVista           | General Mathematical & Geometric Reasoning |
  | MathVerse           | General Mathematical & Geometric Reasoning |
  | WeMath              | General Mathematical & Geometric Reasoning |
  | MMK12               | General Mathematical & Geometric Reasoning |
  | GeoMath             | General Mathematical & Geometric Reasoning |
  | Geometry3K          | General Mathematical & Geometric Reasoning |
  | LogicVista          | Vision-dependent Multimodal Reasoning      |
  | SuperClevr Counting | Vision-dependent Multimodal Reasoning      |
  | MMMU-Pro            | Vision-dependent Multimodal Reasoning      |
  | MathVerse-V         | Vision-dependent Multimodal Reasoning      |
  

## ✍️ Citation

  If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work πŸ“:
  
  ```bibtex
  @article{wang2026vgpo,
    title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
    author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
    journal={arXiv preprint arXiv:2604.09349},
    year={2026}
  }
```

## ❀️ Acknowledgements

  Our codebase is built upon [EasyR1](https://github.com/hiyouga/EasyR1), [VPPO-RL](https://github.com/huaixuheqing/VPPO-RL), [PAPO](https://github.com/MikeWangWZHL/PAPO), [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). We thank the authors for their excellent work.