chamber111
/

VPPO-7B

@@ -12,7 +12,7 @@ pipeline_tag: image-text-to-text
 datasets:
 - chamber111/VPPO_ViRL39K_train
 base_model:
-- Qwen/Qwen3-VL-8B-Instruct
 ---
 # Model Card for VPPO-7B
@@ -21,14 +21,14 @@ base_model:
 ### Model Description
-**VPPO-8B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 8B parameter version of our model, fine-tuned from `Qwen3-VL-8B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
 The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
-As a result, VPPO-8B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
 - **Model type:** Large Vision-Language Model (LVLM)
-- **Finetuned from model:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
 ### Model Sources
@@ -47,13 +47,13 @@ The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)
 #### Training Hyperparameters
-- **Base Model:** Qwen3-VL-8B-Instruct
 - **Algorithm:** VPPO
-- **Steps:** 150
 - **Learning Rate:** 1e-6
 - **Rollout Batch Size:** 384
 - **Max Response Length:** 2048
-- **Entropy Penalty Coefficient:** 0.12 for steps 0-130; 0.18 for steps 131-150
 - **Gradient Filtering Ratio (k):** 0.4
 - **Advantage Shaping Min (β_min):** 0.9
 - **Training Regime:** bf16 mixed precision

 datasets:
 - chamber111/VPPO_ViRL39K_train
 base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 ---
 # Model Card for VPPO-7B
 ### Model Description
+**VPPO-7B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
 The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
+As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
 - **Model type:** Large Vision-Language Model (LVLM)
+- **Finetuned from model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
 ### Model Sources
 #### Training Hyperparameters
+- **Base Model:** Qwen2.5-VL-7B-Instruct
 - **Algorithm:** VPPO
+- **Epochs:** 2
 - **Learning Rate:** 1e-6
 - **Rollout Batch Size:** 384
 - **Max Response Length:** 2048
+- **Entropy Penalty Coefficient:** 0.06
 - **Gradient Filtering Ratio (k):** 0.4
 - **Advantage Shaping Min (β_min):** 0.9
 - **Training Regime:** bf16 mixed precision