--- license: apache-2.0 language: - en tags: - vision-language - video-understanding - foveated-attention - multimodal - smollm2 - dinov2 library_name: pytorch pipeline_tag: image-text-to-text model-index: - name: fVLM-1.7B results: - task: type: video-question-answering name: Video Question Answering dataset: type: MVBench name: MVBench metrics: - type: accuracy value: 30.8 name: Accuracy (coarse_only) - task: type: video-question-answering name: Video Question Answering dataset: type: Video-MME name: Video-MME metrics: - type: accuracy value: 30.5 name: Accuracy (coarse_only) - task: type: question-answering name: Science Question Answering dataset: type: ScienceQA name: ScienceQA metrics: - type: accuracy value: 49.0 name: Accuracy (coarse_only) --- # fVLM-1.7B (Foveated Vision-Language Model) A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU. ## Model Description **fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image. ### Architecture | Component | Details | |-----------|---------| | **Language Model** | SmolLM2-1.7B-Instruct | | **Vision Encoder** | DINOv2-small | | **Attention** | Deep query-guided foveated cross-attention | | **Visual Tokens** | 1 token per frame (query-compressed) | | **Total Parameters** | ~1.84B | | **Query Dimension** | 384 | | **LLM Dimension** | 2048 | | **Visual Scale** | 0.14 | ### How Foveated Attention Works Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism: 1. **DINOv2** encodes each frame into patch features and caches K/V at every layer 2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention) 3. The single output token is projected to LLM dimension and prepended to the text sequence 4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look* This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs. ### Inference Modes fVLM supports three forward modes with different speed/quality tradeoffs: | Mode | Description | Use Case | |------|-------------|----------| | `coarse_only` | Single static-query pass | Fastest; good for images and quick inference | | `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention | | `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video | ## Benchmark Results ### fVLM-1.7B (Stage 3 DPO) | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive | |-----------|-------------|-------------|----------------| | MVBench (3800) | 30.8% | 29.9% | 29.9% | | Video-MME (2700) | 30.5% | 28.2% | 30.4% | | ScienceQA (2017) | 49.0% | 43.8% | 46.6% | ### fVLM-135M (Stage 3 DPO) — for comparison | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive | |-----------|-------------|-------------|----------------| | MVBench | 27.4% | 28.0% | 27.9% | | Video-MME | 26.2% | 29.5% | 28.7% | | ScienceQA | 36.4% | 35.6% | 35.4% | **Scaling gain (1.7B vs 135M):** +3.4pp MVBench, +4.3pp Video-MME, +12.6pp ScienceQA (coarse-only). ## Training Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours. ### Stage 1: Visual Alignment (4.3h, 31,250 steps) - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention - **Loss**: Full-text cross-entropy (predict all tokens) - **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5 - **Batch size**: 32 ### Stage 2: Vision-Language SFT (9.5h, 31,250 steps) - **Objective**: Supervised fine-tuning on vision-language tasks - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention - **Loss**: Answer-only cross-entropy (mask user/system tokens) - **LR**: Flat 3e-5 all components with cosine decay - **Batch size**: 32, gradient checkpointing enabled ### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps) - **Objective**: Align outputs with human preferences - **Data**: RLAIF-V (83K preference pairs) - **Loss**: DPO with beta=0.1 - **LR**: 5e-7 all components - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled ## Bug Fixes in This Version This release includes several important bug fixes over earlier checkpoints: 1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index. 2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate. 3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices. ## Files | File | Description | |------|-------------| | `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format | | `model.safetensors` | Model weights in safetensors format (previous version) | | `model.py` | Full model architecture code | | `train.py` | Training script (all 3 stages) | | `data.py` | Data loading and preprocessing | | `benchmark.py` | Benchmark evaluation code | | `logger.py` | Logging utilities | | `benchmark_results.json` | Full benchmark results with per-category breakdowns | ## Usage ### Setup ```python import torch from torchvision import transforms from transformers import AutoTokenizer from huggingface_hub import hf_hub_download # Download checkpoint ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt") # Build model from model import FoveatedVLM model = FoveatedVLM( llm_name="HuggingFaceTB/SmolLM2-1.7B-Instruct", dino_name="facebook/dinov2-small", query_dim=384, visual_scale=0.14, deep_query=True, ) # Load weights ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False) model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt) model = model.to("cuda").to(torch.bfloat16).eval() tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct") # Standard DINO preprocessing frame_transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), ]) ``` ### Image Input **Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution. ```python from PIL import Image img = Image.open("photo.jpg").convert("RGB") frame_tensor = frame_transform(img) # [3, 224, 224] frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224] ``` ### Video Input For video, sample up to 64 frames uniformly. No replication needed. ```python tensors = [frame_transform(f) for f in video_frames] frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16) # frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64) ``` ### Inference ```python messages = [ {"role": "user", "content": "Describe what is happening in this image."}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda") attention_mask = torch.ones_like(input_ids) loss_mask = torch.ones_like(input_ids, dtype=torch.float32) with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16): result = model( frames=frames, input_ids=input_ids, attention_mask=attention_mask, loss_mask=loss_mask, mode="coarse_fine", # or "coarse_only" or "autoregressive" ) # result["logits"]: [B, S, V] text logits # result["loss"]: scalar cross-entropy loss ``` ## Citation If you use this model, please cite: ```bibtex @misc{fvlm2025, title={fVLM: Foveated Vision-Language Model}, author={Sandeep Sampath Kumar}, year={2025}, url={https://huggingface.co/sanps/fVLM-1.7B} } ``` ## License Apache 2.0