MuMing0102 commited on
Commit
3e90319
ยท
verified ยท
1 Parent(s): d933426

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - PAPOGalaxy/PAPO_ViRL39K_train
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-7B-Instruct
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - VGPO
11
+ - Reinforcement learning
12
+ - Multimodal Reasoning
13
+ - Visual Attention Compensation
14
+ ---
15
+
16
+
17
+ # Model Card for VGPO-RL-7B
18
+
19
+ ## ๐Ÿ“– Overview of VGPO
20
+
21
+ Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to **signal dilution** โ€” generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, **temporal visual forgetting** causes attention to visual inputs to progressively decay as reasoning chains extend.
22
+
23
+ VGPO addresses these issues through three key mechanisms:
24
+ - **Visual Attention Compensation (VAC):** Uses the inherent hidden-state similarity between generated tokens and image tokens as a *Visual Focus Score* to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
25
+ - **Intra-Trajectory Re-weighting:** At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
26
+ - **Inter-Trajectory Re-weighting:** At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.
27
+
28
+ ## ๐Ÿ”— Model Sources
29
+
30
+ - **Github Repository:** [`VGPO`](https://github.com/wzb-bupt/VGPO)
31
+ - **ArXiv Paper:** [`2604.09349`](https://arxiv.org/abs/2604.09349)
32
+
33
+
34
+ ## ๐Ÿ“• Training Datasets
35
+
36
+ | Split | Dataset | Link |
37
+ |:------|:--------|:-----|
38
+ | Train | ViRL39K | [PAPOGalaxy/PAPO_ViRL39K_train](https://huggingface.co/datasets/PAPOGalaxy/PAPO_ViRL39K_train) |
39
+ | Val | MMK12 | [PAPOGalaxy/PAPO_MMK12_test](https://huggingface.co/datasets/PAPOGalaxy/PAPO_MMK12_test) |
40
+
41
+
42
+ ## ๐Ÿ“Š Evaluation
43
+
44
+ We follow the evaluation script of [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). All results are reported as **average accuracy** with inference temperature **0.0**.
45
+
46
+ ### Supported Evaluation Benchmarks
47
+
48
+ | Benchmark | Focus Domain |
49
+ |:--------------------|:-------------------------------------------|
50
+ | MathVista | General Mathematical & Geometric Reasoning |
51
+ | MathVerse | General Mathematical & Geometric Reasoning |
52
+ | WeMath | General Mathematical & Geometric Reasoning |
53
+ | MMK12 | General Mathematical & Geometric Reasoning |
54
+ | GeoMath | General Mathematical & Geometric Reasoning |
55
+ | Geometry3K | General Mathematical & Geometric Reasoning |
56
+ | LogicVista | Vision-dependent Multimodal Reasoning |
57
+ | SuperClevr Counting | Vision-dependent Multimodal Reasoning |
58
+ | MMMU-Pro | Vision-dependent Multimodal Reasoning |
59
+ | MathVerse-V | Vision-dependent Multimodal Reasoning |
60
+
61
+
62
+ ## โœ๏ธ Citation
63
+
64
+ If you find this codebase useful in your research, please consider giving us a star โญ and citing our work ๐Ÿ“:
65
+
66
+ ```bibtex
67
+ @article{wang2026vgpo,
68
+ title={Visually-Guided Policy Optimization for Multimodal Reasoning},
69
+ author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
70
+ journal={arXiv preprint arXiv:2604.09349},
71
+ year={2026}
72
+ }
73
+ ```
74
+
75
+ ## โค๏ธ Acknowledgements
76
+
77
+ Our codebase is built upon [EasyR1](https://github.com/hiyouga/EasyR1), [VPPO-RL](https://github.com/huaixuheqing/VPPO-RL), [PAPO](https://github.com/MikeWangWZHL/PAPO), [Look-Back](https://github.com/PKU-YuanGroup/Look-Back). We thank the authors for their excellent work.
78
+
79
+