sanps
/

fVLM-1.7B

@@ -92,25 +92,21 @@ fVLM supports three forward modes with different speed/quality tradeoffs:
 ### fVLM-1.7B (Stage 3 DPO)
-| Benchmark | Samples | Coarse-Only | Coarse-Fine | Autoregressive |
-|-----------|---------|-------------|-------------|----------------|
-| **MVBench** | 3,800 | **30.8%** | 29.9% | 29.9% |
-| **Video-MME** | 2,700 | **30.5%** | 28.2% | 30.4% |
-| **ScienceQA** | 2,017 | **49.0%** | 43.8% | 46.6% |
-### fVLM-135M (Stage 3 DPO) -- for comparison
-| Benchmark | Coarse-Only | Coarse-Fine | Autoregressive |
 |-----------|-------------|-------------|----------------|
-| **MVBench** | 27.4% | 28.0% | 27.9% |
-| **Video-MME** | 26.2% | **29.5%** | 28.7% |
-| **ScienceQA** | **36.4%** | 35.6% | 35.4% |
-**Key observations:**
-- Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
-- `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
-- At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
 ## Training
 Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.

 ### fVLM-1.7B (Stage 3 DPO)
+| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
+|-----------|-------------|-------------|----------------|
+| MVBench (3800) | 30.8% | 29.9% | 29.9% |
+| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
+| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
+### fVLM-135M (Stage 3 DPO) — for comparison
+| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
 |-----------|-------------|-------------|----------------|
+| MVBench | 27.4% | 28.0% | 27.9% |
+| Video-MME | 26.2% | 29.5% | 28.7% |
+| ScienceQA | 36.4% | 35.6% | 35.4% |
+**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
 ## Training
 Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.