sanps commited on
Commit
77b40f5
·
verified ·
1 Parent(s): 4c4a155

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +11 -15
README.md CHANGED
@@ -92,25 +92,21 @@ fVLM supports three forward modes with different speed/quality tradeoffs:
92
 
93
  ### fVLM-1.7B (Stage 3 DPO)
94
 
95
- | Benchmark | Samples | Coarse-Only | Coarse-Fine | Autoregressive |
96
- |-----------|---------|-------------|-------------|----------------|
97
- | **MVBench** | 3,800 | **30.8%** | 29.9% | 29.9% |
98
- | **Video-MME** | 2,700 | **30.5%** | 28.2% | 30.4% |
99
- | **ScienceQA** | 2,017 | **49.0%** | 43.8% | 46.6% |
100
 
101
- ### fVLM-135M (Stage 3 DPO) -- for comparison
102
 
103
- | Benchmark | Coarse-Only | Coarse-Fine | Autoregressive |
104
  |-----------|-------------|-------------|----------------|
105
- | **MVBench** | 27.4% | 28.0% | 27.9% |
106
- | **Video-MME** | 26.2% | **29.5%** | 28.7% |
107
- | **ScienceQA** | **36.4%** | 35.6% | 35.4% |
108
-
109
- **Key observations:**
110
- - Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
111
- - `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
112
- - At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
113
 
 
114
  ## Training
115
 
116
  Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
 
92
 
93
  ### fVLM-1.7B (Stage 3 DPO)
94
 
95
+ | Benchmark | Coarse-Only | CoarseFine | Autoregressive |
96
+ |-----------|-------------|-------------|----------------|
97
+ | MVBench (3800) | 30.8% | 29.9% | 29.9% |
98
+ | Video-MME (2700) | 30.5% | 28.2% | 30.4% |
99
+ | ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
100
 
101
+ ### fVLM-135M (Stage 3 DPO) for comparison
102
 
103
+ | Benchmark | Coarse-Only | CoarseFine | Autoregressive |
104
  |-----------|-------------|-------------|----------------|
105
+ | MVBench | 27.4% | 28.0% | 27.9% |
106
+ | Video-MME | 26.2% | 29.5% | 28.7% |
107
+ | ScienceQA | 36.4% | 35.6% | 35.4% |
 
 
 
 
 
108
 
109
+ **Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
110
  ## Training
111
 
112
  Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.