fVLM-135M (Foveated Vision-Language Model)
A compact vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos.
Benchmark Results
Video Benchmarks
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|---|---|---|---|---|
| MVBench (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% |
| Video-MME (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% |
Image Benchmarks
| Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
|---|---|---|---|---|
| ScienceQA (2017 MCQ) | 36.0% | 73.8% | 80.0% | 89.6% |
Key context: fVLM-135M uses 1 visual token per frame vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.
Results by Inference Mode
fVLM supports three inference modes with different speed/quality tradeoffs:
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|---|---|---|---|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 34.7% | 36.0% | 36.0% |
- Coarse-Only: Single static-query pass (fastest, no foveation)
- Coarse→Fine: Two-pass parallel forward (training mode, with foveated attention)
- Autoregressive: Sequential inference with KV cache (highest quality)
Analysis
- Foveation helps on video: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
- ScienceQA: Best at 36.0% with coarse_fine/autoregressive modes — foveated attention provides a small benefit even on static images when replicated to 8 frames
- Scale gap: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks
Architecture
| Component | Details |
|---|---|
| Language Model | SmolLM2-135M-Instruct |
| Vision Encoder | DINOv2-small |
| Attention | Deep query-guided foveated cross-attention |
| Visual Tokens | 1 token per frame (query-compressed) |
| Total Parameters | 157.6M |
| Query Dimension | 384 |
| Visual Scale | 0.14 |
How Foveated Attention Works
Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:
- DINOv2 encodes each frame into patch features and caches K/V at every layer
- A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
- The single output token is projected to LLM dimension and prepended to the text sequence
- The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look
This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.
Training Pipeline
Stage 1: Visual Alignment
- Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- Loss: Full-text cross-entropy (predict all tokens)
- LR: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
Stage 2: Vision-Language SFT
- Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- Loss: Answer-only cross-entropy (mask user/system tokens)
- LR: Flat 3e-5 all components with cosine decay
Stage 3: DPO (Direct Preference Optimization)
- Data: RLAIF-V (83K preference pairs)
- Loss: DPO with beta=0.1
- LR: 1e-6 all components
Usage
Setup
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from release.model import FoveatedVLM
# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-135M", "model.safetensors")
# Build model
model = FoveatedVLM(
llm_name="HuggingFaceTB/SmolLM2-135M-Instruct",
dino_name="facebook/dinov2-small",
query_dim=384,
visual_scale=0.14,
deep_query=True,
)
# Load weights
state_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state_dict)
model = model.to("cuda").to(torch.bfloat16).eval()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")
# Standard DINO preprocessing
frame_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
Image Input
Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution (Stage 2 and 3 used replicate_image_frames: 8). Passing a single frame for an image will produce degraded results.
from PIL import Image
img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img) # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] — replicate to 8
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
Video Input
For video, sample up to 64 frames uniformly. No replication needed.
# video_frames: list of PIL Images (sampled from video)
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
Inference
# Tokenize prompt
messages = [
{"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)
# Forward pass (coarse_fine mode recommended for best quality)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
result = model(
frames=frames,
input_ids=input_ids,
attention_mask=attention_mask,
loss_mask=loss_mask,
mode="coarse_fine",
)
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
Inference Modes
| Mode | Description | Use Case |
|---|---|---|
coarse_only |
Single static-query pass | Fastest; good for images |
coarse_fine |
Two-pass parallel forward | Best overall; uses foveated attention |
autoregressive |
Sequential with KV cache | Highest quality for video |
License
Apache 2.0
- Downloads last month
- 27