docs: update benchmarks — add SmolVLM2 comparison, remove Val10K, split video/image
Browse files
README.md
CHANGED
|
@@ -19,29 +19,32 @@ A compact vision-language model that uses **foveated attention** to compress eac
|
|
| 19 |
|
| 20 |
## Benchmark Results
|
| 21 |
|
| 22 |
-
###
|
| 23 |
|
| 24 |
-
| Benchmark |
|
| 25 |
-
|
| 26 |
-
| **MVBench**
|
| 27 |
-
| **Video-MME**
|
| 28 |
-
| **ScienceQA** | Accuracy (2017 MCQ) | 36.4% | — | — | — |
|
| 29 |
-
| **POPE** | Accuracy (9000 Y/N) | 50.0% | — | — | — |
|
| 30 |
-
| **Val 10K** | Loss ↓ (1000 samples) | 1.531 | — | — | — |
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
### Results by Inference Mode
|
| 36 |
|
| 37 |
fVLM supports three inference modes with different speed/quality tradeoffs:
|
| 38 |
|
| 39 |
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|
| 40 |
-
|
| 41 |
-
| Val 10K (loss ↓) | 1.879 | 1.533 | **1.531** |
|
| 42 |
| MVBench | 27.4% | **28.0%** | 27.9% |
|
| 43 |
| Video-MME | 26.2% | **29.5%** | 28.7% |
|
| 44 |
-
| POPE | 50.0% | 50.0% | 50.0% |
|
| 45 |
| ScienceQA | **36.4%** | 35.6% | 35.4% |
|
| 46 |
|
| 47 |
- **Coarse-Only**: Single static-query pass (fastest, no foveation)
|
|
@@ -51,9 +54,8 @@ fVLM supports three inference modes with different speed/quality tradeoffs:
|
|
| 51 |
### Analysis
|
| 52 |
|
| 53 |
- **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
|
| 54 |
-
- **
|
| 55 |
-
- **
|
| 56 |
-
- **ScienceQA**: Best at 36.4% with coarse-only — static images don't benefit from foveation
|
| 57 |
|
| 58 |
## Architecture
|
| 59 |
|
|
@@ -95,28 +97,6 @@ This enables processing **64+ frames** with the same memory as a few frames in t
|
|
| 95 |
- **Loss**: DPO with beta=0.1
|
| 96 |
- **LR**: 1e-6 all components
|
| 97 |
|
| 98 |
-
## Training Performance
|
| 99 |
-
|
| 100 |
-
Optimized for A100 80GB with coarse-pass optimization (skip text in coarse LLM — causal attention makes it mathematically equivalent):
|
| 101 |
-
|
| 102 |
-
| Config | Throughput | Memory |
|
| 103 |
-
|--------|-----------|--------|
|
| 104 |
-
| 135M, bs=32 | ~30 samp/s | 8 GB |
|
| 105 |
-
| 1.7B, bs=32, grad_ckpt | 15.7 samp/s | 26.5 GB |
|
| 106 |
-
|
| 107 |
-
## Model Components
|
| 108 |
-
|
| 109 |
-
The checkpoint contains the full `FoveatedVLM` model:
|
| 110 |
-
|
| 111 |
-
- `encoder.dino.*` — DINOv2-small vision backbone
|
| 112 |
-
- `encoder.query_input_proj.*` — Query projection (bias=False)
|
| 113 |
-
- `encoder.output_proj.*` — Output projection
|
| 114 |
-
- `dino_to_llm.*` — DINO→LLM dimension projection
|
| 115 |
-
- `llm_to_query.*` — LLM→query dimension projection
|
| 116 |
-
- `q_static` — Learnable static query for coarse pass
|
| 117 |
-
- `q_init` — Learnable initial query for fine pass
|
| 118 |
-
- `llm.*` — SmolLM2-135M-Instruct language model
|
| 119 |
-
|
| 120 |
## Usage
|
| 121 |
|
| 122 |
```python
|
|
|
|
| 19 |
|
| 20 |
## Benchmark Results
|
| 21 |
|
| 22 |
+
### Video Benchmarks
|
| 23 |
|
| 24 |
+
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|
| 25 |
+
|-----------|:---------:|:------------:|:------------:|:------------:|
|
| 26 |
+
| **MVBench** (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% |
|
| 27 |
+
| **Video-MME** (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% |
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
+
### Image Benchmarks
|
| 30 |
+
|
| 31 |
+
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|
| 32 |
+
|-----------|:---------:|:------------:|:------------:|:------------:|
|
| 33 |
+
| **ScienceQA** (2017 MCQ) | 36.4% | 73.8% | 80.0% | 89.6% |
|
| 34 |
+
| **POPE** (9000 Y/N) | 50.0%* | — | — | — |
|
| 35 |
+
|
| 36 |
+
\* POPE at 50% = random baseline. The 135M model always predicts one class. Not reported by SmolVLM2.
|
| 37 |
+
|
| 38 |
+
> **Key context**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.
|
| 39 |
|
| 40 |
### Results by Inference Mode
|
| 41 |
|
| 42 |
fVLM supports three inference modes with different speed/quality tradeoffs:
|
| 43 |
|
| 44 |
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|
| 45 |
+
|-----------|:----------:|:-----------:|:--------------:|
|
|
|
|
| 46 |
| MVBench | 27.4% | **28.0%** | 27.9% |
|
| 47 |
| Video-MME | 26.2% | **29.5%** | 28.7% |
|
|
|
|
| 48 |
| ScienceQA | **36.4%** | 35.6% | 35.4% |
|
| 49 |
|
| 50 |
- **Coarse-Only**: Single static-query pass (fastest, no foveation)
|
|
|
|
| 54 |
### Analysis
|
| 55 |
|
| 56 |
- **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
|
| 57 |
+
- **ScienceQA**: Best at 36.4% with coarse-only — static images don't benefit from temporal foveation
|
| 58 |
+
- **Scale gap**: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks
|
|
|
|
| 59 |
|
| 60 |
## Architecture
|
| 61 |
|
|
|
|
| 97 |
- **Loss**: DPO with beta=0.1
|
| 98 |
- **LR**: 1e-6 all components
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
## Usage
|
| 101 |
|
| 102 |
```python
|