Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -92,25 +92,21 @@ fVLM supports three forward modes with different speed/quality tradeoffs:
|
|
| 92 |
|
| 93 |
### fVLM-1.7B (Stage 3 DPO)
|
| 94 |
|
| 95 |
-
| Benchmark |
|
| 96 |
-
|
| 97 |
-
|
|
| 98 |
-
|
|
| 99 |
-
|
|
| 100 |
|
| 101 |
-
### fVLM-135M (Stage 3 DPO)
|
| 102 |
|
| 103 |
-
| Benchmark | Coarse-Only | Coarse
|
| 104 |
|-----------|-------------|-------------|----------------|
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
| 109 |
-
**Key observations:**
|
| 110 |
-
- Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
|
| 111 |
-
- `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
|
| 112 |
-
- At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
|
| 113 |
|
|
|
|
| 114 |
## Training
|
| 115 |
|
| 116 |
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
|
|
|
|
| 92 |
|
| 93 |
### fVLM-1.7B (Stage 3 DPO)
|
| 94 |
|
| 95 |
+
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|
| 96 |
+
|-----------|-------------|-------------|----------------|
|
| 97 |
+
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
|
| 98 |
+
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
|
| 99 |
+
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
|
| 100 |
|
| 101 |
+
### fVLM-135M (Stage 3 DPO) — for comparison
|
| 102 |
|
| 103 |
+
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|
| 104 |
|-----------|-------------|-------------|----------------|
|
| 105 |
+
| MVBench | 27.4% | 28.0% | 27.9% |
|
| 106 |
+
| Video-MME | 26.2% | 29.5% | 28.7% |
|
| 107 |
+
| ScienceQA | 36.4% | 35.6% | 35.4% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
|
| 110 |
## Training
|
| 111 |
|
| 112 |
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
|