File size: 9,266 Bytes
78b1e06 8631a23 78b1e06 8631a23 78b1e06 7fbae21 78b1e06 7fbae21 78b1e06 7fbae21 78b1e06 8631a23 77b40f5 8631a23 77b40f5 8631a23 77b40f5 8631a23 77b40f5 8631a23 77b40f5 7fbae21 78b1e06 8631a23 78b1e06 8631a23 7fbae21 78b1e06 8631a23 78b1e06 8631a23 7fbae21 78b1e06 8631a23 7fbae21 78b1e06 7fbae21 8631a23 7fbae21 8631a23 7fbae21 8631a23 78b1e06 7fbae21 78b1e06 7fbae21 78b1e06 7fbae21 78b1e06 7fbae21 78b1e06 8631a23 78b1e06 8631a23 78b1e06 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | ---
license: apache-2.0
language:
- en
tags:
- vision-language
- video-understanding
- foveated-attention
- multimodal
- smollm2
- dinov2
library_name: pytorch
pipeline_tag: image-text-to-text
model-index:
- name: fVLM-1.7B
results:
- task:
type: video-question-answering
name: Video Question Answering
dataset:
type: MVBench
name: MVBench
metrics:
- type: accuracy
value: 30.8
name: Accuracy (coarse_only)
- task:
type: video-question-answering
name: Video Question Answering
dataset:
type: Video-MME
name: Video-MME
metrics:
- type: accuracy
value: 30.5
name: Accuracy (coarse_only)
- task:
type: question-answering
name: Science Question Answering
dataset:
type: ScienceQA
name: ScienceQA
metrics:
- type: accuracy
value: 49.0
name: Accuracy (coarse_only)
---
# fVLM-1.7B (Foveated Vision-Language Model)
A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.
## Model Description
**fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.
### Architecture
| Component | Details |
|-----------|---------|
| **Language Model** | SmolLM2-1.7B-Instruct |
| **Vision Encoder** | DINOv2-small |
| **Attention** | Deep query-guided foveated cross-attention |
| **Visual Tokens** | 1 token per frame (query-compressed) |
| **Total Parameters** | ~1.84B |
| **Query Dimension** | 384 |
| **LLM Dimension** | 2048 |
| **Visual Scale** | 0.14 |
### How Foveated Attention Works
Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:
1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
3. The single output token is projected to LLM dimension and prepended to the text sequence
4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*
This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
### Inference Modes
fVLM supports three forward modes with different speed/quality tradeoffs:
| Mode | Description | Use Case |
|------|-------------|----------|
| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |
## Benchmark Results
### fVLM-1.7B (Stage 3 DPO)
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
### fVLM-135M (Stage 3 DPO) — for comparison
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 36.4% | 35.6% | 35.4% |
**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
## Training
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
### Stage 1: Visual Alignment (4.3h, 31,250 steps)
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- **Loss**: Full-text cross-entropy (predict all tokens)
- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
- **Batch size**: 32
### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
- **Objective**: Supervised fine-tuning on vision-language tasks
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
- **LR**: Flat 3e-5 all components with cosine decay
- **Batch size**: 32, gradient checkpointing enabled
### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
- **Objective**: Align outputs with human preferences
- **Data**: RLAIF-V (83K preference pairs)
- **Loss**: DPO with beta=0.1
- **LR**: 5e-7 all components
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
## Bug Fixes in This Version
This release includes several important bug fixes over earlier checkpoints:
1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.
## Files
| File | Description |
|------|-------------|
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
| `model.safetensors` | Model weights in safetensors format (previous version) |
| `model.py` | Full model architecture code |
| `train.py` | Training script (all 3 stages) |
| `data.py` | Data loading and preprocessing |
| `benchmark.py` | Benchmark evaluation code |
| `logger.py` | Logging utilities |
| `benchmark_results.json` | Full benchmark results with per-category breakdowns |
## Usage
### Setup
```python
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")
# Build model
from model import FoveatedVLM
model = FoveatedVLM(
llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
dino_name="facebook/dinov2-small",
query_dim=384,
visual_scale=0.14,
deep_query=True,
)
# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
# Standard DINO preprocessing
frame_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```
### Image Input
**Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution.
```python
from PIL import Image
img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img) # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
```
### Video Input
For video, sample up to 64 frames uniformly. No replication needed.
```python
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
```
### Inference
```python
messages = [
{"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
result = model(
frames=frames,
input_ids=input_ids,
attention_mask=attention_mask,
loss_mask=loss_mask,
mode="coarse_fine", # or "coarse_only" or "autoregressive"
)
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
```
## Citation
If you use this model, please cite:
```bibtex
@misc{fvlm2025,
title={fVLM: Foveated Vision-Language Model},
author={Sandeep Sampath Kumar},
year={2025},
url={https://huggingface.co/sanps/fVLM-1.7B}
}
```
## License
Apache 2.0
|