File size: 9,266 Bytes
78b1e06
 
 
 
 
 
 
 
 
 
 
 
 
8631a23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78b1e06
 
 
 
8631a23
78b1e06
7fbae21
78b1e06
7fbae21
78b1e06
7fbae21
78b1e06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8631a23
 
 
 
 
 
 
 
 
 
 
 
 
 
77b40f5
 
 
 
 
8631a23
77b40f5
8631a23
77b40f5
8631a23
77b40f5
 
 
8631a23
77b40f5
7fbae21
78b1e06
8631a23
78b1e06
8631a23
7fbae21
78b1e06
 
8631a23
78b1e06
 
8631a23
7fbae21
78b1e06
 
 
 
 
8631a23
7fbae21
78b1e06
 
 
 
 
7fbae21
 
8631a23
7fbae21
 
 
 
 
 
 
 
 
 
 
8631a23
7fbae21
 
 
 
 
 
8631a23
78b1e06
 
 
 
 
 
 
 
 
 
 
 
7fbae21
78b1e06
7fbae21
78b1e06
 
 
 
 
 
 
 
 
 
 
7fbae21
 
78b1e06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7fbae21
78b1e06
 
 
 
 
 
 
 
 
 
 
 
8631a23
78b1e06
 
 
 
 
8631a23
 
 
 
 
 
 
 
 
 
 
 
 
78b1e06
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
license: apache-2.0
language:
  - en
tags:
  - vision-language
  - video-understanding
  - foveated-attention
  - multimodal
  - smollm2
  - dinov2
library_name: pytorch
pipeline_tag: image-text-to-text
model-index:
  - name: fVLM-1.7B
    results:
      - task:
          type: video-question-answering
          name: Video Question Answering
        dataset:
          type: MVBench
          name: MVBench
        metrics:
          - type: accuracy
            value: 30.8
            name: Accuracy (coarse_only)
      - task:
          type: video-question-answering
          name: Video Question Answering
        dataset:
          type: Video-MME
          name: Video-MME
        metrics:
          - type: accuracy
            value: 30.5
            name: Accuracy (coarse_only)
      - task:
          type: question-answering
          name: Science Question Answering
        dataset:
          type: ScienceQA
          name: ScienceQA
        metrics:
          - type: accuracy
            value: 49.0
            name: Accuracy (coarse_only)
---

# fVLM-1.7B (Foveated Vision-Language Model)

A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.

## Model Description

**fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.

### Architecture

| Component | Details |
|-----------|---------|
| **Language Model** | SmolLM2-1.7B-Instruct |
| **Vision Encoder** | DINOv2-small |
| **Attention** | Deep query-guided foveated cross-attention |
| **Visual Tokens** | 1 token per frame (query-compressed) |
| **Total Parameters** | ~1.84B |
| **Query Dimension** | 384 |
| **LLM Dimension** | 2048 |
| **Visual Scale** | 0.14 |

### How Foveated Attention Works

Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:

1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
3. The single output token is projected to LLM dimension and prepended to the text sequence
4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*

This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.

### Inference Modes

fVLM supports three forward modes with different speed/quality tradeoffs:

| Mode | Description | Use Case |
|------|-------------|----------|
| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |

## Benchmark Results

### fVLM-1.7B (Stage 3 DPO)

| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |

### fVLM-135M (Stage 3 DPO) — for comparison

| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 36.4% | 35.6% | 35.4% |

**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
## Training

Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.

### Stage 1: Visual Alignment (4.3h, 31,250 steps)
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- **Loss**: Full-text cross-entropy (predict all tokens)
- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
- **Batch size**: 32

### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
- **Objective**: Supervised fine-tuning on vision-language tasks
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
- **LR**: Flat 3e-5 all components with cosine decay
- **Batch size**: 32, gradient checkpointing enabled

### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
- **Objective**: Align outputs with human preferences
- **Data**: RLAIF-V (83K preference pairs)
- **Loss**: DPO with beta=0.1
- **LR**: 5e-7 all components
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled

## Bug Fixes in This Version

This release includes several important bug fixes over earlier checkpoints:

1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.

2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.

3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.

## Files

| File | Description |
|------|-------------|
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
| `model.safetensors` | Model weights in safetensors format (previous version) |
| `model.py` | Full model architecture code |
| `train.py` | Training script (all 3 stages) |
| `data.py` | Data loading and preprocessing |
| `benchmark.py` | Benchmark evaluation code |
| `logger.py` | Logging utilities |
| `benchmark_results.json` | Full benchmark results with per-category breakdowns |

## Usage

### Setup

```python
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")

# Build model
from model import FoveatedVLM

model = FoveatedVLM(
    llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    dino_name="facebook/dinov2-small",
    query_dim=384,
    visual_scale=0.14,
    deep_query=True,
)

# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

# Standard DINO preprocessing
frame_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```

### Image Input

**Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution.

```python
from PIL import Image

img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img)                      # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1)   # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16)  # [1, 8, 3, 224, 224]
```

### Video Input

For video, sample up to 64 frames uniformly. No replication needed.

```python
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
```

### Inference

```python
messages = [
    {"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    result = model(
        frames=frames,
        input_ids=input_ids,
        attention_mask=attention_mask,
        loss_mask=loss_mask,
        mode="coarse_fine",       # or "coarse_only" or "autoregressive"
    )
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
```

## Citation

If you use this model, please cite:

```bibtex
@misc{fvlm2025,
  title={fVLM: Foveated Vision-Language Model},
  author={Sandeep Sampath Kumar},
  year={2025},
  url={https://huggingface.co/sanps/fVLM-1.7B}
}
```

## License

Apache 2.0