| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - vision-language |
| | - video-understanding |
| | - foveated-attention |
| | - multimodal |
| | - smollm2 |
| | - dinov2 |
| | library_name: pytorch |
| | pipeline_tag: image-text-to-text |
| | model-index: |
| | - name: fVLM-1.7B |
| | results: |
| | - task: |
| | type: video-question-answering |
| | name: Video Question Answering |
| | dataset: |
| | type: MVBench |
| | name: MVBench |
| | metrics: |
| | - type: accuracy |
| | value: 30.8 |
| | name: Accuracy (coarse_only) |
| | - task: |
| | type: video-question-answering |
| | name: Video Question Answering |
| | dataset: |
| | type: Video-MME |
| | name: Video-MME |
| | metrics: |
| | - type: accuracy |
| | value: 30.5 |
| | name: Accuracy (coarse_only) |
| | - task: |
| | type: question-answering |
| | name: Science Question Answering |
| | dataset: |
| | type: ScienceQA |
| | name: ScienceQA |
| | metrics: |
| | - type: accuracy |
| | value: 49.0 |
| | name: Accuracy (coarse_only) |
| | --- |
| | |
| | # fVLM-1.7B (Foveated Vision-Language Model) |
| |
|
| | A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU. |
| |
|
| | ## Model Description |
| |
|
| | **fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image. |
| |
|
| | ### Architecture |
| |
|
| | | Component | Details | |
| | |-----------|---------| |
| | | **Language Model** | SmolLM2-1.7B-Instruct | |
| | | **Vision Encoder** | DINOv2-small | |
| | | **Attention** | Deep query-guided foveated cross-attention | |
| | | **Visual Tokens** | 1 token per frame (query-compressed) | |
| | | **Total Parameters** | ~1.84B | |
| | | **Query Dimension** | 384 | |
| | | **LLM Dimension** | 2048 | |
| | | **Visual Scale** | 0.14 | |
| |
|
| | ### How Foveated Attention Works |
| |
|
| | Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism: |
| |
|
| | 1. **DINOv2** encodes each frame into patch features and caches K/V at every layer |
| | 2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention) |
| | 3. The single output token is projected to LLM dimension and prepended to the text sequence |
| | 4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look* |
| |
|
| | This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs. |
| |
|
| | ### Inference Modes |
| |
|
| | fVLM supports three forward modes with different speed/quality tradeoffs: |
| |
|
| | | Mode | Description | Use Case | |
| | |------|-------------|----------| |
| | | `coarse_only` | Single static-query pass | Fastest; good for images and quick inference | |
| | | `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention | |
| | | `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video | |
| |
|
| | ## Benchmark Results |
| |
|
| | ### fVLM-1.7B (Stage 3 DPO) |
| |
|
| | | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive | |
| | |-----------|-------------|-------------|----------------| |
| | | MVBench (3800) | 30.8% | 29.9% | 29.9% | |
| | | Video-MME (2700) | 30.5% | 28.2% | 30.4% | |
| | | ScienceQA (2017) | 49.0% | 43.8% | 46.6% | |
| |
|
| | ### fVLM-135M (Stage 3 DPO) — for comparison |
| |
|
| | | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive | |
| | |-----------|-------------|-------------|----------------| |
| | | MVBench | 27.4% | 28.0% | 27.9% | |
| | | Video-MME | 26.2% | 29.5% | 28.7% | |
| | | ScienceQA | 36.4% | 35.6% | 35.4% | |
| |
|
| | **Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only). |
| | ## Training |
| |
|
| | Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours. |
| |
|
| | ### Stage 1: Visual Alignment (4.3h, 31,250 steps) |
| | - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space |
| | - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention |
| | - **Loss**: Full-text cross-entropy (predict all tokens) |
| | - **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5 |
| | - **Batch size**: 32 |
| |
|
| | ### Stage 2: Vision-Language SFT (9.5h, 31,250 steps) |
| | - **Objective**: Supervised fine-tuning on vision-language tasks |
| | - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention |
| | - **Loss**: Answer-only cross-entropy (mask user/system tokens) |
| | - **LR**: Flat 3e-5 all components with cosine decay |
| | - **Batch size**: 32, gradient checkpointing enabled |
| |
|
| | ### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps) |
| | - **Objective**: Align outputs with human preferences |
| | - **Data**: RLAIF-V (83K preference pairs) |
| | - **Loss**: DPO with beta=0.1 |
| | - **LR**: 5e-7 all components |
| | - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled |
| |
|
| | ## Bug Fixes in This Version |
| |
|
| | This release includes several important bug fixes over earlier checkpoints: |
| |
|
| | 1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index. |
| |
|
| | 2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate. |
| |
|
| | 3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices. |
| |
|
| | ## Files |
| |
|
| | | File | Description | |
| | |------|-------------| |
| | | `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format | |
| | | `model.safetensors` | Model weights in safetensors format (previous version) | |
| | | `model.py` | Full model architecture code | |
| | | `train.py` | Training script (all 3 stages) | |
| | | `data.py` | Data loading and preprocessing | |
| | | `benchmark.py` | Benchmark evaluation code | |
| | | `logger.py` | Logging utilities | |
| | | `benchmark_results.json` | Full benchmark results with per-category breakdowns | |
| |
|
| | ## Usage |
| |
|
| | ### Setup |
| |
|
| | ```python |
| | import torch |
| | from torchvision import transforms |
| | from transformers import AutoTokenizer |
| | from huggingface_hub import hf_hub_download |
| | |
| | # Download checkpoint |
| | ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt") |
| | |
| | # Build model |
| | from model import FoveatedVLM |
| | |
| | model = FoveatedVLM( |
| | llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct", |
| | dino_name="facebook/dinov2-small", |
| | query_dim=384, |
| | visual_scale=0.14, |
| | deep_query=True, |
| | ) |
| | |
| | # Load weights |
| | ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False) |
| | model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt) |
| | model = model.to("cuda").to(torch.bfloat16).eval() |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct") |
| | |
| | # Standard DINO preprocessing |
| | frame_transform = transforms.Compose([ |
| | transforms.Resize((224, 224)), |
| | transforms.ToTensor(), |
| | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), |
| | ]) |
| | ``` |
| |
|
| | ### Image Input |
| |
|
| | **Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution. |
| |
|
| | ```python |
| | from PIL import Image |
| | |
| | img = Image.open("photo.jpg").convert("RGB") |
| | frame_tensor = frame_transform(img) # [3, 224, 224] |
| | frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] |
| | frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224] |
| | ``` |
| |
|
| | ### Video Input |
| |
|
| | For video, sample up to 64 frames uniformly. No replication needed. |
| |
|
| | ```python |
| | tensors = [frame_transform(f) for f in video_frames] |
| | frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16) |
| | # frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64) |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```python |
| | messages = [ |
| | {"role": "user", "content": "Describe what is happening in this image."}, |
| | ] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda") |
| | attention_mask = torch.ones_like(input_ids) |
| | loss_mask = torch.ones_like(input_ids, dtype=torch.float32) |
| | |
| | with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16): |
| | result = model( |
| | frames=frames, |
| | input_ids=input_ids, |
| | attention_mask=attention_mask, |
| | loss_mask=loss_mask, |
| | mode="coarse_fine", # or "coarse_only" or "autoregressive" |
| | ) |
| | # result["logits"]: [B, S, V] text logits |
| | # result["loss"]: scalar cross-entropy loss |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{fvlm2025, |
| | title={fVLM: Foveated Vision-Language Model}, |
| | author={Sandeep Sampath Kumar}, |
| | year={2025}, |
| | url={https://huggingface.co/sanps/fVLM-1.7B} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|