Update model card: drop POPE, correct ScienceQA to 36.0%, add inference modes table

05571a0 verified 7 days ago

7.15 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- vision-language
	- video-understanding
	- foveated-attention
	- multimodal
	- smollm2
	- dinov2
	library_name: pytorch
	pipeline_tag: image-text-to-text
	---

	# fVLM-135M (Foveated Vision-Language Model)

	A compact vision-language model that uses foveated attention to compress each video frame into a single visual token, enabling efficient processing of long videos.

	## Benchmark Results

	### Video Benchmarks

	\| Benchmark \| fVLM-135M \| SmolVLM2-256M \| SmolVLM2-500M \| SmolVLM2-2.2B \|
	\|-----------\|:---------:\|:------------:\|:------------:\|:------------:\|
	\| MVBench (3800 MCQ) \| 28.0% \| 32.7% \| 39.7% \| 46.3% \|
	\| Video-MME (2700 MCQ) \| 29.5% \| 33.7% \| 42.2% \| 52.1% \|

	### Image Benchmarks

	\| Benchmark \| fVLM-135M \| SmolVLM2-256M \| SmolVLM2-500M \| SmolVLM2-2.2B \|
	\|-----------\|:---------:\|:------------:\|:------------:\|:------------:\|
	\| ScienceQA (2017 MCQ) \| 36.0% \| 73.8% \| 80.0% \| 89.6% \|

	> Key context: fVLM-135M uses 1 visual token per frame vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression.

	### Results by Inference Mode

	fVLM supports three inference modes with different speed/quality tradeoffs:

	\| Benchmark \| Coarse-Only \| Coarse→Fine \| Autoregressive \|
	\|-----------\|:----------:\|:-----------:\|:--------------:\|
	\| MVBench \| 27.4% \| 28.0% \| 27.9% \|
	\| Video-MME \| 26.2% \| 29.5% \| 28.7% \|
	\| ScienceQA \| 34.7% \| 36.0% \| 36.0% \|

	- Coarse-Only: Single static-query pass (fastest, no foveation)
	- Coarse→Fine: Two-pass parallel forward (training mode, with foveated attention)
	- Autoregressive: Sequential inference with KV cache (highest quality)

	### Analysis

	- Foveation helps on video: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
	- ScienceQA: Best at 36.0% with coarse_fine/autoregressive modes — foveated attention provides a small benefit even on static images when replicated to 8 frames
	- Scale gap: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks

	## Architecture

	\| Component \| Details \|
	\|-----------\|---------\|
	\| Language Model \| SmolLM2-135M-Instruct \|
	\| Vision Encoder \| DINOv2-small \|
	\| Attention \| Deep query-guided foveated cross-attention \|
	\| Visual Tokens \| 1 token per frame (query-compressed) \|
	\| Total Parameters \| 157.6M \|
	\| Query Dimension \| 384 \|
	\| Visual Scale \| 0.14 \|

	### How Foveated Attention Works

	Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a single visual token using a learned query mechanism:

	1. DINOv2 encodes each frame into patch features and caches K/V at every layer
	2. A query vector is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
	3. The single output token is projected to LLM dimension and prepended to the text sequence
	4. The LLM generates the next query from its hidden state, creating a feedback loop where the model learns where to look

	This enables processing 64+ frames with the same memory as a few frames in traditional VLMs.

	## Training Pipeline

	### Stage 1: Visual Alignment
	- Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
	- Loss: Full-text cross-entropy (predict all tokens)
	- LR: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5

	### Stage 2: Vision-Language SFT
	- Data: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
	- Loss: Answer-only cross-entropy (mask user/system tokens)
	- LR: Flat 3e-5 all components with cosine decay

	### Stage 3: DPO (Direct Preference Optimization)
	- Data: RLAIF-V (83K preference pairs)
	- Loss: DPO with beta=0.1
	- LR: 1e-6 all components

	## Usage

	### Setup

	```python
	import torch
	from torchvision import transforms
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download
	from release.model import FoveatedVLM

	# Download checkpoint
	ckpt_path = hf_hub_download("sanps/fVLM-135M", "model.safetensors")

	# Build model
	model = FoveatedVLM(
	llm_name="HuggingFaceTB/SmolLM2-135M-Instruct",
	dino_name="facebook/dinov2-small",
	query_dim=384,
	visual_scale=0.14,
	deep_query=True,
	)

	# Load weights
	state_dict = torch.load(ckpt_path, map_location="cpu")
	model.load_state_dict(state_dict)
	model = model.to("cuda").to(torch.bfloat16).eval()

	tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct")

	# Standard DINO preprocessing
	frame_transform = transforms.Compose([
	transforms.Resize((224, 224)),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
	])
	```

	### Image Input

	Important: fVLM treats all inputs as video. Static images must be replicated to 8 frames to match training distribution (Stage 2 and 3 used `replicate_image_frames: 8`). Passing a single frame for an image will produce degraded results.

	```python
	from PIL import Image

	img = Image.open("photo.jpg").convert("RGB")
	frame_tensor = frame_transform(img) # [3, 224, 224]
	frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] — replicate to 8
	frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
	```

	### Video Input

	For video, sample up to 64 frames uniformly. No replication needed.

	```python
	# video_frames: list of PIL Images (sampled from video)
	tensors = [frame_transform(f) for f in video_frames]
	frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
	# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
	```

	### Inference

	```python
	# Tokenize prompt
	messages = [
	{"role": "user", "content": "Describe what is happening in this image."},
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
	attention_mask = torch.ones_like(input_ids)
	loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

	# Forward pass (coarse_fine mode recommended for best quality)
	with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
	result = model(
	frames=frames,
	input_ids=input_ids,
	attention_mask=attention_mask,
	loss_mask=loss_mask,
	mode="coarse_fine",
	)
	# result["logits"]: [B, S, V] text logits
	# result["loss"]: scalar cross-entropy loss
	```

	### Inference Modes

	\| Mode \| Description \| Use Case \|
	\|------\|-------------\|----------\|
	\| `coarse_only` \| Single static-query pass \| Fastest; good for images \|
	\| `coarse_fine` \| Two-pass parallel forward \| Best overall; uses foveated attention \|
	\| `autoregressive` \| Sequential with KV cache \| Highest quality for video \|

	## License

	Apache 2.0