| | --- |
| | license: mit |
| | tags: |
| | - multimodal |
| | - visual-reasoning |
| | - mathematics |
| | - logic |
| | - qwen |
| | - vppo |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | datasets: |
| | - chamber111/VPPO_ViRL39K_train |
| | base_model: |
| | - Qwen/Qwen2.5-VL-7B-Instruct |
| | --- |
| | |
| | # Model Card for VPPO-7B |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | **VPPO-7B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**. |
| |
|
| | The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability. |
| |
|
| | As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence. |
| |
|
| | - **Model type:** Large Vision-Language Model (LVLM) |
| | - **Finetuned from model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** [`VPPO-RL`](https://github.com/huaixuheqing/VPPO-RL) |
| | - **Paper:** [`2510.09285`](https://arxiv.org/abs/2510.09285) |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | The model was fine-tuned on [**ViRL39K**](https://huggingface.co/datasets/chamber111/VPPO_ViRL39K_train), a diverse collection of multimodal reasoning problems. The original dataset can be found on the Hugging Face Hub: [`TIGER-Lab/ViRL39K`](https://huggingface.co/datasets/TIGER-Lab/ViRL39K). |
| |
|
| | ### Training Procedure |
| |
|
| | The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)** algorithm, which is a modification of the Group Relative Policy Optimization (GRPO) framework. The procedure involves generating responses, calculating token-level visual dependency, and using this dependency to shape the advantage and filter gradients during the policy update step. |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **Base Model:** Qwen2.5-VL-7B-Instruct |
| | - **Algorithm:** VPPO |
| | - **Epochs:** 2 |
| | - **Learning Rate:** 1e-6 |
| | - **Rollout Batch Size:** 384 |
| | - **Max Response Length:** 2048 |
| | - **Entropy Penalty Coefficient:** 0.06 |
| | - **Gradient Filtering Ratio (k):** 0.4 |
| | - **Advantage Shaping Min (β_min):** 0.9 |
| | - **Training Regime:** bf16 mixed precision |
| | |
| | ## Evaluation |
| | |
| | ### Testing Data, Factors & Metrics |
| | |
| | #### Testing Data |
| | |
| | The model was evaluated on a comprehensive suite of 8 diverse multimodal reasoning benchmarks: |
| | - **Math & Geometry:** Geo3k, We-Math, MathVerse, MathVision, DynaMath, MMK12 |
| | - **Logic:** LogicVista |
| | - **Multi-discipline:** MMMU-Pro |
| | |
| | #### Metrics |
| | |
| | Performance is measured by **average accuracy@8**, which is the average success rate over 8 independent generations per problem (at temperature=1.0) using exact-match scoring. |
| | |
| | ## Citation |
| | |
| | If you use this model in your work, please cite our paper: |
| | |
| | **BibTeX:** |
| | |
| | ```bibtex |
| | @article{huang2025spotlight, |
| | title={Spotlight on Token Perception for Multimodal Reinforcement Learning}, |
| | author={Huang, Siyuan and Qu, Xiaoye and Li, Yafu and Luo, Yun and He, Zefeng and Liu, Daizong and Cheng, Yu}, |
| | journal={arXiv preprint arXiv:2510.09285}, |
| | year={2025} |
| | } |
| | ``` |