VGPO-RL-32B / README.md
MuMing0102's picture
Create README.md
27c5267 verified
metadata
license: mit
datasets:
  - PAPOGalaxy/PAPO_ViRL39K_train
base_model:
  - Qwen/Qwen2.5-VL-32B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - VGPO
  - Reinforcement learning
  - Multimodal Reasoning
  - Visual Attention Compensation

Model Card for VGPO-RL-7B

πŸ“– Overview of VGPO

Standard RLVR methods treat every generated token equally, broadcasting a single reward signal indiscriminately. This leads to signal dilution β€” generic text tokens receive the same reinforcement as critical visually-grounded reasoning steps. Meanwhile, temporal visual forgetting causes attention to visual inputs to progressively decay as reasoning chains extend.

VGPO addresses these issues through three key mechanisms:

  • Visual Attention Compensation (VAC): Uses the inherent hidden-state similarity between generated tokens and image tokens as a Visual Focus Score to localize visual activations without external supervision. A progressive incentive schedule counteracts temporal visual forgetting in later reasoning steps.
  • Intra-Trajectory Re-weighting: At the token level, dynamically re-weights advantages by visual focus scores to amplify learning from visually-grounded tokens.
  • Inter-Trajectory Re-weighting: At the trajectory level, prioritizes rollouts with superior visual accumulation, favoring trajectories that sustain consistent visual grounding.

πŸ”— Model Sources

πŸ“• Training Datasets

Split Dataset Link
Train ViRL39K PAPOGalaxy/PAPO_ViRL39K_train
Val MMK12 PAPOGalaxy/PAPO_MMK12_test

πŸ“Š Evaluation

We follow the evaluation script of Look-Back. All results are reported as average accuracy with inference temperature 0.0.

Supported Evaluation Benchmarks

Benchmark Focus Domain
MathVista General Mathematical & Geometric Reasoning
MathVerse General Mathematical & Geometric Reasoning
WeMath General Mathematical & Geometric Reasoning
MMK12 General Mathematical & Geometric Reasoning
GeoMath General Mathematical & Geometric Reasoning
Geometry3K General Mathematical & Geometric Reasoning
LogicVista Vision-dependent Multimodal Reasoning
SuperClevr Counting Vision-dependent Multimodal Reasoning
MMMU-Pro Vision-dependent Multimodal Reasoning
MathVerse-V Vision-dependent Multimodal Reasoning

✍️ Citation

If you find this codebase useful in your research, please consider giving us a star ⭐ and citing our work πŸ“:

@article{wang2026vgpo,
  title={Visually-Guided Policy Optimization for Multimodal Reasoning}, 
  author={Zengbin Wang and Feng Xiong and Liang Lin and Xuecai Hu and Yong Wang and Yanlin Wang and Man Zhang and Xiangxiang Chu},
  journal={arXiv preprint arXiv:2604.09349},
  year={2026}
}

❀️ Acknowledgements

Our codebase is built upon EasyR1, VPPO-RL, PAPO, Look-Back. We thank the authors for their excellent work.