File size: 9,266 Bytes

---
license: apache-2.0
language:
  - en
tags:
  - vision-language
  - video-understanding
  - foveated-attention
  - multimodal
  - smollm2
  - dinov2
library_name: pytorch
pipeline_tag: image-text-to-text
model-index:
  - name: fVLM-1.7B
    results:
      - task:
          type: video-question-answering
          name: Video Question Answering
        dataset:
          type: MVBench
          name: MVBench
        metrics:
          - type: accuracy
            value: 30.8
            name: Accuracy (coarse_only)
      - task:
          type: video-question-answering
          name: Video Question Answering
        dataset:
          type: Video-MME
          name: Video-MME
        metrics:
          - type: accuracy
            value: 30.5
            name: Accuracy (coarse_only)
      - task:
          type: question-answering
          name: Science Question Answering
        dataset:
          type: ScienceQA
          name: ScienceQA
        metrics:
          - type: accuracy
            value: 49.0
            name: Accuracy (coarse_only)
---

# fVLM-1.7B (Foveated Vision-Language Model)

A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.

## Model Description

**fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.

### Architecture

| Component | Details |
|-----------|---------|
| **Language Model** | SmolLM2-1.7B-Instruct |
| **Vision Encoder** | DINOv2-small |
| **Attention** | Deep query-guided foveated cross-attention |
| **Visual Tokens** | 1 token per frame (query-compressed) |
| **Total Parameters** | ~1.84B |
| **Query Dimension** | 384 |
| **LLM Dimension** | 2048 |
| **Visual Scale** | 0.14 |

### How Foveated Attention Works

Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:

1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
3. The single output token is projected to LLM dimension and prepended to the text sequence
4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*

This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.

### Inference Modes

fVLM supports three forward modes with different speed/quality tradeoffs:

| Mode | Description | Use Case |
|------|-------------|----------|
| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |

## Benchmark Results

### fVLM-1.7B (Stage 3 DPO)

| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |

### fVLM-135M (Stage 3 DPO) — for comparison

| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 36.4% | 35.6% | 35.4% |

**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
## Training

Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.

### Stage 1: Visual Alignment (4.3h, 31,250 steps)
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- **Loss**: Full-text cross-entropy (predict all tokens)
- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
- **Batch size**: 32

### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
- **Objective**: Supervised fine-tuning on vision-language tasks
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
- **LR**: Flat 3e-5 all components with cosine decay
- **Batch size**: 32, gradient checkpointing enabled

### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
- **Objective**: Align outputs with human preferences
- **Data**: RLAIF-V (83K preference pairs)
- **Loss**: DPO with beta=0.1
- **LR**: 5e-7 all components
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled

## Bug Fixes in This Version

This release includes several important bug fixes over earlier checkpoints:

1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.

2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.

3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.

## Files

| File | Description |
|------|-------------|
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
| `model.safetensors` | Model weights in safetensors format (previous version) |
| `model.py` | Full model architecture code |
| `train.py` | Training script (all 3 stages) |
| `data.py` | Data loading and preprocessing |
| `benchmark.py` | Benchmark evaluation code |
| `logger.py` | Logging utilities |
| `benchmark_results.json` | Full benchmark results with per-category breakdowns |

## Usage

### Setup

```python
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")

# Build model
from model import FoveatedVLM

model = FoveatedVLM(
    llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
    dino_name="facebook/dinov2-small",
    query_dim=384,
    visual_scale=0.14,
    deep_query=True,
)

# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")

# Standard DINO preprocessing
frame_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```

### Image Input

**Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution.

```python
from PIL import Image

img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img)                      # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1)   # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16)  # [1, 8, 3, 224, 224]
```

### Video Input

For video, sample up to 64 frames uniformly. No replication needed.

```python
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
```

### Inference

```python
messages = [
    {"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)

with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    result = model(
        frames=frames,
        input_ids=input_ids,
        attention_mask=attention_mask,
        loss_mask=loss_mask,
        mode="coarse_fine",       # or "coarse_only" or "autoregressive"
    )
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
```

## Citation

If you use this model, please cite:

```bibtex
@misc{fvlm2025,
  title={fVLM: Foveated Vision-Language Model},
  author={Sandeep Sampath Kumar},
  year={2025},
  url={https://huggingface.co/sanps/fVLM-1.7B}
}
```

## License

Apache 2.0