sanps
/

fVLM-135M

@@ -19,29 +19,32 @@ A compact vision-language model that uses **foveated attention** to compress eac
 ## Benchmark Results
-### Summary
-| Benchmark | Metric | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
-|-----------|--------|-----------|---------------|---------------|---------------|
-| **MVBench** | Accuracy (3800 MCQ) | 28.0% | 32.7% | 40.0% | 47.0% |
-| **Video-MME** | Accuracy (2700 MCQ) | 29.5% | 33.7% | 42.5% | 52.2% |
-| **ScienceQA** | Accuracy (2017 MCQ) | 36.4% | — | — | — |
-| **POPE** | Accuracy (9000 Y/N) | 50.0% | — | — | — |
-| **Val 10K** | Loss ↓ (1000 samples) | 1.531 | — | — | — |
-> **Note**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64–256 tokens per image.
-> Despite being 2× smaller than SmolVLM2-256M, fVLM-135M scores within 4–5% on video benchmarks — demonstrating the efficiency of foveated attention.
 ### Results by Inference Mode
 fVLM supports three inference modes with different speed/quality tradeoffs:
 | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
-|-----------|------------|-------------|----------------|
-| Val 10K (loss ↓) | 1.879 | 1.533 | **1.531** |
 | MVBench | 27.4% | **28.0%** | 27.9% |
 | Video-MME | 26.2% | **29.5%** | 28.7% |
-| POPE | 50.0% | 50.0% | 50.0% |
 | ScienceQA | **36.4%** | 35.6% | 35.4% |
 - **Coarse-Only**: Single static-query pass (fastest, no foveation)
@@ -51,9 +54,8 @@ fVLM supports three inference modes with different speed/quality tradeoffs:
 ### Analysis
 - **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
-- **MVBench**: +3% above random baseline (25%), modest but expected for 135M params
-- **POPE**: At random baseline — model consistently predicts one class (expected at this scale)
-- **ScienceQA**: Best at 36.4% with coarse-only — static images don't benefit from foveation
 ## Architecture
@@ -95,28 +97,6 @@ This enables processing **64+ frames** with the same memory as a few frames in t
 - **Loss**: DPO with beta=0.1
 - **LR**: 1e-6 all components
-## Training Performance
-Optimized for A100 80GB with coarse-pass optimization (skip text in coarse LLM — causal attention makes it mathematically equivalent):
-| Config | Throughput | Memory |
-|--------|-----------|--------|
-| 135M, bs=32 | ~30 samp/s | 8 GB |
-| 1.7B, bs=32, grad_ckpt | 15.7 samp/s | 26.5 GB |
-## Model Components
-The checkpoint contains the full `FoveatedVLM` model:
-- `encoder.dino.*` — DINOv2-small vision backbone
-- `encoder.query_input_proj.*` — Query projection (bias=False)
-- `encoder.output_proj.*` — Output projection
-- `dino_to_llm.*` — DINO→LLM dimension projection
-- `llm_to_query.*` — LLM→query dimension projection
-- `q_static` — Learnable static query for coarse pass
-- `q_init` — Learnable initial query for fine pass
-- `llm.*` — SmolLM2-135M-Instruct language model
 ## Usage
 ```python

 ## Benchmark Results
+### Video Benchmarks
+| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
+|-----------|:---------:|:------------:|:------------:|:------------:|
+| **MVBench** (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% |
+| **Video-MME** (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% |
+### Image Benchmarks
+| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
+|-----------|:---------:|:------------:|:------------:|:------------:|
+| **ScienceQA** (2017 MCQ) | 36.4% | 73.8% | 80.0% | 89.6% |
+| **POPE** (9000 Y/N) | 50.0%* | — | — | — |
+\* POPE at 50% = random baseline. The 135M model always predicts one class. Not reported by SmolVLM2.
+> **Key context**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.
 ### Results by Inference Mode
 fVLM supports three inference modes with different speed/quality tradeoffs:
 | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
+|-----------|:----------:|:-----------:|:--------------:|
 | MVBench | 27.4% | **28.0%** | 27.9% |
 | Video-MME | 26.2% | **29.5%** | 28.7% |
 | ScienceQA | **36.4%** | 35.6% | 35.4% |
 - **Coarse-Only**: Single static-query pass (fastest, no foveation)
 ### Analysis
 - **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
+- **ScienceQA**: Best at 36.4% with coarse-only — static images don't benefit from temporal foveation
+- **Scale gap**: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks
 ## Architecture
 - **Loss**: DPO with beta=0.1
 - **LR**: 1e-6 all components
 ## Usage
 ```python