Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ pipeline_tag: image-text-to-text
|
|
| 12 |
datasets:
|
| 13 |
- chamber111/VPPO_ViRL39K_train
|
| 14 |
base_model:
|
| 15 |
-
- Qwen/
|
| 16 |
---
|
| 17 |
|
| 18 |
# Model Card for VPPO-7B
|
|
@@ -21,14 +21,14 @@ base_model:
|
|
| 21 |
|
| 22 |
### Model Description
|
| 23 |
|
| 24 |
-
**VPPO-
|
| 25 |
|
| 26 |
The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
|
| 27 |
|
| 28 |
-
As a result, VPPO-
|
| 29 |
|
| 30 |
- **Model type:** Large Vision-Language Model (LVLM)
|
| 31 |
-
- **Finetuned from model:** [`Qwen/
|
| 32 |
|
| 33 |
### Model Sources
|
| 34 |
|
|
@@ -47,13 +47,13 @@ The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)
|
|
| 47 |
|
| 48 |
#### Training Hyperparameters
|
| 49 |
|
| 50 |
-
- **Base Model:**
|
| 51 |
- **Algorithm:** VPPO
|
| 52 |
-
- **
|
| 53 |
- **Learning Rate:** 1e-6
|
| 54 |
- **Rollout Batch Size:** 384
|
| 55 |
- **Max Response Length:** 2048
|
| 56 |
-
- **Entropy Penalty Coefficient:** 0.
|
| 57 |
- **Gradient Filtering Ratio (k):** 0.4
|
| 58 |
- **Advantage Shaping Min (β_min):** 0.9
|
| 59 |
- **Training Regime:** bf16 mixed precision
|
|
|
|
| 12 |
datasets:
|
| 13 |
- chamber111/VPPO_ViRL39K_train
|
| 14 |
base_model:
|
| 15 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 16 |
---
|
| 17 |
|
| 18 |
# Model Card for VPPO-7B
|
|
|
|
| 21 |
|
| 22 |
### Model Description
|
| 23 |
|
| 24 |
+
**VPPO-7B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
|
| 25 |
|
| 26 |
The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
|
| 27 |
|
| 28 |
+
As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
|
| 29 |
|
| 30 |
- **Model type:** Large Vision-Language Model (LVLM)
|
| 31 |
+
- **Finetuned from model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
|
| 32 |
|
| 33 |
### Model Sources
|
| 34 |
|
|
|
|
| 47 |
|
| 48 |
#### Training Hyperparameters
|
| 49 |
|
| 50 |
+
- **Base Model:** Qwen2.5-VL-7B-Instruct
|
| 51 |
- **Algorithm:** VPPO
|
| 52 |
+
- **Epochs:** 2
|
| 53 |
- **Learning Rate:** 1e-6
|
| 54 |
- **Rollout Batch Size:** 384
|
| 55 |
- **Max Response Length:** 2048
|
| 56 |
+
- **Entropy Penalty Coefficient:** 0.06
|
| 57 |
- **Gradient Filtering Ratio (k):** 0.4
|
| 58 |
- **Advantage Shaping Min (β_min):** 0.9
|
| 59 |
- **Training Regime:** bf16 mixed precision
|