sanps commited on
Commit
6164fab
·
verified ·
1 Parent(s): 27c3839

docs: update benchmarks — add SmolVLM2 comparison, remove Val10K, split video/image

Browse files
Files changed (1) hide show
  1. README.md +18 -38
README.md CHANGED
@@ -19,29 +19,32 @@ A compact vision-language model that uses **foveated attention** to compress eac
19
 
20
  ## Benchmark Results
21
 
22
- ### Summary
23
 
24
- | Benchmark | Metric | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
25
- |-----------|--------|-----------|---------------|---------------|---------------|
26
- | **MVBench** | Accuracy (3800 MCQ) | 28.0% | 32.7% | 40.0% | 47.0% |
27
- | **Video-MME** | Accuracy (2700 MCQ) | 29.5% | 33.7% | 42.5% | 52.2% |
28
- | **ScienceQA** | Accuracy (2017 MCQ) | 36.4% | — | — | — |
29
- | **POPE** | Accuracy (9000 Y/N) | 50.0% | — | — | — |
30
- | **Val 10K** | Loss ↓ (1000 samples) | 1.531 | — | — | — |
31
 
32
- > **Note**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64–256 tokens per image.
33
- > Despite being 2× smaller than SmolVLM2-256M, fVLM-135M scores within 4–5% on video benchmarks — demonstrating the efficiency of foveated attention.
 
 
 
 
 
 
 
 
34
 
35
  ### Results by Inference Mode
36
 
37
  fVLM supports three inference modes with different speed/quality tradeoffs:
38
 
39
  | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
40
- |-----------|------------|-------------|----------------|
41
- | Val 10K (loss ↓) | 1.879 | 1.533 | **1.531** |
42
  | MVBench | 27.4% | **28.0%** | 27.9% |
43
  | Video-MME | 26.2% | **29.5%** | 28.7% |
44
- | POPE | 50.0% | 50.0% | 50.0% |
45
  | ScienceQA | **36.4%** | 35.6% | 35.4% |
46
 
47
  - **Coarse-Only**: Single static-query pass (fastest, no foveation)
@@ -51,9 +54,8 @@ fVLM supports three inference modes with different speed/quality tradeoffs:
51
  ### Analysis
52
 
53
  - **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
54
- - **MVBench**: +3% above random baseline (25%), modest but expected for 135M params
55
- - **POPE**: At random baseline model consistently predicts one class (expected at this scale)
56
- - **ScienceQA**: Best at 36.4% with coarse-only — static images don't benefit from foveation
57
 
58
  ## Architecture
59
 
@@ -95,28 +97,6 @@ This enables processing **64+ frames** with the same memory as a few frames in t
95
  - **Loss**: DPO with beta=0.1
96
  - **LR**: 1e-6 all components
97
 
98
- ## Training Performance
99
-
100
- Optimized for A100 80GB with coarse-pass optimization (skip text in coarse LLM — causal attention makes it mathematically equivalent):
101
-
102
- | Config | Throughput | Memory |
103
- |--------|-----------|--------|
104
- | 135M, bs=32 | ~30 samp/s | 8 GB |
105
- | 1.7B, bs=32, grad_ckpt | 15.7 samp/s | 26.5 GB |
106
-
107
- ## Model Components
108
-
109
- The checkpoint contains the full `FoveatedVLM` model:
110
-
111
- - `encoder.dino.*` — DINOv2-small vision backbone
112
- - `encoder.query_input_proj.*` — Query projection (bias=False)
113
- - `encoder.output_proj.*` — Output projection
114
- - `dino_to_llm.*` — DINO→LLM dimension projection
115
- - `llm_to_query.*` — LLM→query dimension projection
116
- - `q_static` — Learnable static query for coarse pass
117
- - `q_init` — Learnable initial query for fine pass
118
- - `llm.*` — SmolLM2-135M-Instruct language model
119
-
120
  ## Usage
121
 
122
  ```python
 
19
 
20
  ## Benchmark Results
21
 
22
+ ### Video Benchmarks
23
 
24
+ | Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
25
+ |-----------|:---------:|:------------:|:------------:|:------------:|
26
+ | **MVBench** (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% |
27
+ | **Video-MME** (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% |
 
 
 
28
 
29
+ ### Image Benchmarks
30
+
31
+ | Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
32
+ |-----------|:---------:|:------------:|:------------:|:------------:|
33
+ | **ScienceQA** (2017 MCQ) | 36.4% | 73.8% | 80.0% | 89.6% |
34
+ | **POPE** (9000 Y/N) | 50.0%* | — | — | — |
35
+
36
+ \* POPE at 50% = random baseline. The 135M model always predicts one class. Not reported by SmolVLM2.
37
+
38
+ > **Key context**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.
39
 
40
  ### Results by Inference Mode
41
 
42
  fVLM supports three inference modes with different speed/quality tradeoffs:
43
 
44
  | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
45
+ |-----------|:----------:|:-----------:|:--------------:|
 
46
  | MVBench | 27.4% | **28.0%** | 27.9% |
47
  | Video-MME | 26.2% | **29.5%** | 28.7% |
 
48
  | ScienceQA | **36.4%** | 35.6% | 35.4% |
49
 
50
  - **Coarse-Only**: Single static-query pass (fastest, no foveation)
 
54
  ### Analysis
55
 
56
  - **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
57
+ - **ScienceQA**: Best at 36.4% with coarse-only static images don't benefit from temporal foveation
58
+ - **Scale gap**: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks
 
59
 
60
  ## Architecture
61
 
 
97
  - **Loss**: DPO with beta=0.1
98
  - **LR**: 1e-6 all components
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
  ## Usage
101
 
102
  ```python