fVLM-1.7B / README.md
sanps's picture
Upload README.md with huggingface_hub
77b40f5 verified
---
license: apache-2.0
language:
- en
tags:
- vision-language
- video-understanding
- foveated-attention
- multimodal
- smollm2
- dinov2
library_name: pytorch
pipeline_tag: image-text-to-text
model-index:
- name: fVLM-1.7B
results:
- task:
type: video-question-answering
name: Video Question Answering
dataset:
type: MVBench
name: MVBench
metrics:
- type: accuracy
value: 30.8
name: Accuracy (coarse_only)
- task:
type: video-question-answering
name: Video Question Answering
dataset:
type: Video-MME
name: Video-MME
metrics:
- type: accuracy
value: 30.5
name: Accuracy (coarse_only)
- task:
type: question-answering
name: Science Question Answering
dataset:
type: ScienceQA
name: ScienceQA
metrics:
- type: accuracy
value: 49.0
name: Accuracy (coarse_only)
---
# fVLM-1.7B (Foveated Vision-Language Model)
A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.
## Model Description
**fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.
### Architecture
| Component | Details |
|-----------|---------|
| **Language Model** | SmolLM2-1.7B-Instruct |
| **Vision Encoder** | DINOv2-small |
| **Attention** | Deep query-guided foveated cross-attention |
| **Visual Tokens** | 1 token per frame (query-compressed) |
| **Total Parameters** | ~1.84B |
| **Query Dimension** | 384 |
| **LLM Dimension** | 2048 |
| **Visual Scale** | 0.14 |
### How Foveated Attention Works
Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:
1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
3. The single output token is projected to LLM dimension and prepended to the text sequence
4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*
This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
### Inference Modes
fVLM supports three forward modes with different speed/quality tradeoffs:
| Mode | Description | Use Case |
|------|-------------|----------|
| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |
## Benchmark Results
### fVLM-1.7B (Stage 3 DPO)
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench (3800) | 30.8% | 29.9% | 29.9% |
| Video-MME (2700) | 30.5% | 28.2% | 30.4% |
| ScienceQA (2017) | 49.0% | 43.8% | 46.6% |
### fVLM-135M (Stage 3 DPO) — for comparison
| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
|-----------|-------------|-------------|----------------|
| MVBench | 27.4% | 28.0% | 27.9% |
| Video-MME | 26.2% | 29.5% | 28.7% |
| ScienceQA | 36.4% | 35.6% | 35.4% |
**Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only).
## Training
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
### Stage 1: Visual Alignment (4.3h, 31,250 steps)
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
- **Loss**: Full-text cross-entropy (predict all tokens)
- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
- **Batch size**: 32
### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
- **Objective**: Supervised fine-tuning on vision-language tasks
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
- **LR**: Flat 3e-5 all components with cosine decay
- **Batch size**: 32, gradient checkpointing enabled
### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
- **Objective**: Align outputs with human preferences
- **Data**: RLAIF-V (83K preference pairs)
- **Loss**: DPO with beta=0.1
- **LR**: 5e-7 all components
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
## Bug Fixes in This Version
This release includes several important bug fixes over earlier checkpoints:
1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.
## Files
| File | Description |
|------|-------------|
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
| `model.safetensors` | Model weights in safetensors format (previous version) |
| `model.py` | Full model architecture code |
| `train.py` | Training script (all 3 stages) |
| `data.py` | Data loading and preprocessing |
| `benchmark.py` | Benchmark evaluation code |
| `logger.py` | Logging utilities |
| `benchmark_results.json` | Full benchmark results with per-category breakdowns |
## Usage
### Setup
```python
import torch
from torchvision import transforms
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Download checkpoint
ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")
# Build model
from model import FoveatedVLM
model = FoveatedVLM(
llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct",
dino_name="facebook/dinov2-small",
query_dim=384,
visual_scale=0.14,
deep_query=True,
)
# Load weights
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
model = model.to("cuda").to(torch.bfloat16).eval()
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
# Standard DINO preprocessing
frame_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
```
### Image Input
**Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution.
```python
from PIL import Image
img = Image.open("photo.jpg").convert("RGB")
frame_tensor = frame_transform(img) # [3, 224, 224]
frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224]
frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224]
```
### Video Input
For video, sample up to 64 frames uniformly. No replication needed.
```python
tensors = [frame_transform(f) for f in video_frames]
frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
# frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64)
```
### Inference
```python
messages = [
{"role": "user", "content": "Describe what is happening in this image."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
attention_mask = torch.ones_like(input_ids)
loss_mask = torch.ones_like(input_ids, dtype=torch.float32)
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
result = model(
frames=frames,
input_ids=input_ids,
attention_mask=attention_mask,
loss_mask=loss_mask,
mode="coarse_fine", # or "coarse_only" or "autoregressive"
)
# result["logits"]: [B, S, V] text logits
# result["loss"]: scalar cross-entropy loss
```
## Citation
If you use this model, please cite:
```bibtex
@misc{fvlm2025,
title={fVLM: Foveated Vision-Language Model},
author={Sandeep Sampath Kumar},
year={2025},
url={https://huggingface.co/sanps/fVLM-1.7B}
}
```
## License
Apache 2.0