chamber111 commited on
Commit
ccbdf0a
·
verified ·
1 Parent(s): 4cde9cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -12,7 +12,7 @@ pipeline_tag: image-text-to-text
12
  datasets:
13
  - chamber111/VPPO_ViRL39K_train
14
  base_model:
15
- - Qwen/Qwen3-VL-8B-Instruct
16
  ---
17
 
18
  # Model Card for VPPO-7B
@@ -21,14 +21,14 @@ base_model:
21
 
22
  ### Model Description
23
 
24
- **VPPO-8B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 8B parameter version of our model, fine-tuned from `Qwen3-VL-8B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
25
 
26
  The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
27
 
28
- As a result, VPPO-8B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
29
 
30
  - **Model type:** Large Vision-Language Model (LVLM)
31
- - **Finetuned from model:** [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
32
 
33
  ### Model Sources
34
 
@@ -47,13 +47,13 @@ The model was trained using our **Visually-Perceptive Policy Optimization (VPPO)
47
 
48
  #### Training Hyperparameters
49
 
50
- - **Base Model:** Qwen3-VL-8B-Instruct
51
  - **Algorithm:** VPPO
52
- - **Steps:** 150
53
  - **Learning Rate:** 1e-6
54
  - **Rollout Batch Size:** 384
55
  - **Max Response Length:** 2048
56
- - **Entropy Penalty Coefficient:** 0.12 for steps 0-130; 0.18 for steps 131-150
57
  - **Gradient Filtering Ratio (k):** 0.4
58
  - **Advantage Shaping Min (β_min):** 0.9
59
  - **Training Regime:** bf16 mixed precision
 
12
  datasets:
13
  - chamber111/VPPO_ViRL39K_train
14
  base_model:
15
+ - Qwen/Qwen2.5-VL-7B-Instruct
16
  ---
17
 
18
  # Model Card for VPPO-7B
 
21
 
22
  ### Model Description
23
 
24
+ **VPPO-7B** is a state-of-the-art Large Vision-Language Model (LVLM) specialized for complex multimodal reasoning tasks. It is the 7B parameter version of our model, fine-tuned from `Qwen2.5-VL-7B-Instruct` using a novel reinforcement learning algorithm called **Visually-Perceptive Policy Optimization (VPPO)**.
25
 
26
  The core innovation of VPPO is its ability to solve the "uniform learning signal" problem that plagues standard RL fine-tuning. Instead of broadcasting a single reward to all tokens in a reasoning chain, VPPO intelligently identifies and focuses policy updates on the sparse, critical tokens that are highly dependent on visual input. This hierarchical "spotlight" mechanism allows the model to develop a more robust and genuine perception-grounded reasoning capability.
27
 
28
+ As a result, VPPO-7B demonstrates significant performance improvements over strong baselines across a wide range of challenging benchmarks, including mathematics, geometry, and logic problems. It also exhibits superior training stability and faster convergence.
29
 
30
  - **Model type:** Large Vision-Language Model (LVLM)
31
+ - **Finetuned from model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
32
 
33
  ### Model Sources
34
 
 
47
 
48
  #### Training Hyperparameters
49
 
50
+ - **Base Model:** Qwen2.5-VL-7B-Instruct
51
  - **Algorithm:** VPPO
52
+ - **Epochs:** 2
53
  - **Learning Rate:** 1e-6
54
  - **Rollout Batch Size:** 384
55
  - **Max Response Length:** 2048
56
+ - **Entropy Penalty Coefficient:** 0.06
57
  - **Gradient Filtering Ratio (k):** 0.4
58
  - **Advantage Shaping Min (β_min):** 0.9
59
  - **Training Regime:** bf16 mixed precision