fVLM-135M (Foveated Vision-Language Model)

A compact vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos.

Benchmark Results

Video Benchmarks

Benchmark	fVLM-135M	SmolVLM2-256M	SmolVLM2-500M	SmolVLM2-2.2B
MVBench (3800 MCQ)	28.0%	32.7%	39.7%	46.3%
Video-MME (2700 MCQ)	29.5%	33.7%	42.2%	52.1%

Image Benchmarks

Benchmark	fVLM-135M	SmolVLM2-256M	SmolVLM2-500M	SmolVLM2-2.2B
ScienceQA (2017 MCQ)	36.0%	73.8%	80.0%	89.6%

Key context: fVLM-135M uses 1 visual token per frame vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.

Results by Inference Mode

fVLM supports three inference modes with different speed/quality tradeoffs:

Benchmark	Coarse-Only	Coarse→Fine	Autoregressive
MVBench	27.4%	28.0%	27.9%
Video-MME	26.2%	29.5%	28.7%
ScienceQA	34.7%	36.0%	36.0%

Coarse-Only: Single static-query pass (fastest, no foveation)
Coarse→Fine: Two-pass parallel forward (training mode, with foveated attention)
Autoregressive: Sequential inference with KV cache (highest quality)

Analysis

Foveation helps on video: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
ScienceQA: Best at 36.0% with coarse_fine/autoregressive modes — foveated attention provides a small benefit even on static images when replicated to 8 frames
Scale gap: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks

Architecture

Component	Details
Language Model	SmolLM2-135M-Instruct
Vision Encoder	DINOv2-small
Attention	Deep query-guided foveated cross-attention
Visual Tokens	1 token per frame (query-compressed)
Total Parameters	157.6M
Query Dimension	384
Visual Scale	0.14

How Foveated Attention Works

Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:

DINOv2 encodes each frame into patch features and caches K/V at every layer
A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
The single output token is projected to LLM dimension and prepended to the text sequence
The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look

This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.

Training Pipeline

Stage 1: Visual Alignment

Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
Loss: Full-text cross-entropy (predict all tokens)
LR: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5

Stage 2: Vision-Language SFT

Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
Loss: Answer-only cross-entropy (mask user/system tokens)
LR: Flat 3e-5 all components with cosine decay

Stage 3: DPO (Direct Preference Optimization)

Data: RLAIF-V (83K preference pairs)
Loss: DPO with beta=0.1
LR: 1e-6 all components

Usage

Setup

import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from release.model import FoveatedVLM

# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-135M", "model.safetensors")

# Build model
model = FoveatedVLM(
    llm_name="HuggingFaceTB/SmolLM2-135M-Instruct",
    dino_name="facebook/dinov2-small",
    query_dim=384,
    visual_scale=0.14,
    deep_query=True,
)

# Load weights
state_dict = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(state_dict)
model = model.to("cuda").to(torch.bfloat16).eval()

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

# Standard DINO preprocessing
frame_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Image Input

Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution (Stage 2 and 3 used replicate_image_frames: 8). Passing a single frame for an image will produce degraded results.

from PIL import Image

img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img)                      # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1)   # [8, 3, 224, 224] — replicate to 8
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16)  # [1, 8, 3, 224, 224]

Video Input

For video, sample up to 64 frames uniformly. No replication needed.

# video_frames: list of PIL Images (sampled from video)
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)

Inference

# Tokenize prompt
messages = [
    {"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

# Forward pass (coarse_fine mode recommended for best quality)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    result = model(
        frames=frames,
        input_ids=input_ids,
        attention_mask=attention_mask,
        loss_mask=loss_mask,
        mode="coarse_fine",
    )
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss

Inference Modes

Mode	Description	Use Case
`coarse_only`	Single static-query pass	Fastest; good for images
`coarse_fine`	Two-pass parallel forward	Best overall; uses foveated attention
`autoregressive`	Sequential with KV cache	Highest quality for video

License

Apache 2.0

Downloads last month: 6

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support