fVLM-135M / README.md
sanps's picture
Update model card: drop POPE, correct ScienceQA to 36.0%, add inference modes table
05571a0 verified
---
license: apache-2.0
language:
- en
tags:
- vision-language
- video-understanding
- foveated-attention
- multimodal
- smollm2
- dinov2
library_name: pytorch
pipeline_tag: image-text-to-text
---
# fVLM-135M (Foveated Vision-Language Model)
A compact vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
## Benchmark Results
### Video Benchmarks
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|-----------|:---------:|:------------:|:------------:|:------------:|
| **MVBench** (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% |
| **Video-MME** (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% |
### Image Benchmarks
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|-----------|:---------:|:------------:|:------------:|:------------:|
| **ScienceQA** (2017 MCQ) | 36.0% | 73.8% | 80.0% | 89.6% |
> **Key context**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.
### Results by Inference Mode
fVLM supports three inference modes with different speed/quality tradeoffs:
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|:----------:|:-----------:|:--------------:|
| MVBench | 27.4% | **28.0%** | 27.9% |
| Video-MME | 26.2% | **29.5%** | 28.7% |
| ScienceQA | 34.7% | **36.0%** | **36.0%** |
- **Coarse-Only**: Single static-query pass (fastest, no foveation)
- **Coarse→Fine**: Two-pass parallel forward (training mode, with foveated attention)
- **Autoregressive**: Sequential inference with KV cache (highest quality)
### Analysis
- **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
- **ScienceQA**: Best at 36.0% with coarse_fine/autoregressive modes — foveated attention provides a small benefit even on static images when replicated to 8 frames
- **Scale gap**: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks
## Architecture
| Component | Details |
|-----------|---------|
| **Language Model** | SmolLM2-135M-Instruct |
| **Vision Encoder** | DINOv2-small |
| **Attention** | Deep query-guided foveated cross-attention |
| **Visual Tokens** | 1 token per frame (query-compressed) |
| **Total Parameters** | 157.6M |
| **Query Dimension** | 384 |
| **Visual Scale** | 0.14 |
### How Foveated Attention Works
Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:
1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
3. The single output token is projected to LLM dimension and prepended to the text sequence
4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*
This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
## Training Pipeline
### Stage 1: Visual Alignment
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- **Loss**: Full-text cross-entropy (predict all tokens)
- **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
### Stage 2: Vision-Language SFT
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
- **LR**: Flat 3e-5 all components with cosine decay
### Stage 3: DPO (Direct Preference Optimization)
- **Data**: RLAIF-V (83K preference pairs)
- **Loss**: DPO with beta=0.1
- **LR**: 1e-6 all components
## Usage
### Setup
```python
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from release.model import FoveatedVLM
# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-135M", "model.safetensors")
# Build model
model = FoveatedVLM(
llm_name="HuggingFaceTB/SmolLM2-135M-Instruct",
dino_name="facebook/dinov2-small",
query_dim=384,
visual_scale=0.14,
deep_query=True,
)
# Load weights
state_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state_dict)
model = model.to("cuda").to(torch.bfloat16).eval()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
# Standard DINO preprocessing
frame_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```
### Image Input
**Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution (Stage 2 and 3 used `replicate_image_frames: 8`). Passing a single frame for an image will produce degraded results.
```python
from PIL import Image
img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img) # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] — replicate to 8
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
```
### Video Input
For video, sample up to 64 frames uniformly. No replication needed.
```python
# video_frames: list of PIL Images (sampled from video)
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
```
### Inference
```python
# Tokenize prompt
messages = [
{"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)
# Forward pass (coarse_fine mode recommended for best quality)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
result = model(
frames=frames,
input_ids=input_ids,
attention_mask=attention_mask,
loss_mask=loss_mask,
mode="coarse_fine",
)
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
```
### Inference Modes
| Mode | Description | Use Case |
|------|-------------|----------|
| `coarse_only` | Single static-query pass | Fastest; good for images |
| `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
| `autoregressive` | Sequential with KV cache | Highest quality for video |
## License
Apache 2.0