| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - vision-language |
| | - video-understanding |
| | - foveated-attention |
| | - multimodal |
| | - smollm2 |
| | - dinov2 |
| | library_name: pytorch |
| | pipeline_tag: image-text-to-text |
| | --- |
| | |
| | # fVLM-135M (Foveated Vision-Language Model) |
| |
|
| | A compact vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos. |
| |
|
| | ## Benchmark Results |
| |
|
| | ### Video Benchmarks |
| |
|
| | | Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B | |
| | |-----------|:---------:|:------------:|:------------:|:------------:| |
| | | **MVBench** (3800 MCQ) | 28.0% | 32.7% | 39.7% | 46.3% | |
| | | **Video-MME** (2700 MCQ) | 29.5% | 33.7% | 42.2% | 52.1% | |
| |
|
| | ### Image Benchmarks |
| |
|
| | | Benchmark | fVLM-135M | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B | |
| | |-----------|:---------:|:------------:|:------------:|:------------:| |
| | | **ScienceQA** (2017 MCQ) | 36.0% | 73.8% | 80.0% | 89.6% | |
| |
|
| | > **Key context**: fVLM-135M uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-135M has 158M params total — 1.6x smaller than SmolVLM2-256M. The gap on video benchmarks (4-5%) is modest given the extreme compression. |
| |
|
| | ### Results by Inference Mode |
| |
|
| | fVLM supports three inference modes with different speed/quality tradeoffs: |
| |
|
| | | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive | |
| | |-----------|:----------:|:-----------:|:--------------:| |
| | | MVBench | 27.4% | **28.0%** | 27.9% | |
| | | Video-MME | 26.2% | **29.5%** | 28.7% | |
| | | ScienceQA | 34.7% | **36.0%** | **36.0%** | |
| |
|
| | - **Coarse-Only**: Single static-query pass (fastest, no foveation) |
| | - **Coarse→Fine**: Two-pass parallel forward (training mode, with foveated attention) |
| | - **Autoregressive**: Sequential inference with KV cache (highest quality) |
| |
|
| | ### Analysis |
| |
|
| | - **Foveation helps on video**: coarse→fine adds +3.3% on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding |
| | - **ScienceQA**: Best at 36.0% with coarse_fine/autoregressive modes — foveated attention provides a small benefit even on static images when replicated to 8 frames |
| | - **Scale gap**: The large gap on ScienceQA (36% vs 74%) shows the 135M backbone limits image reasoning. Video benchmarks are closer because foveated compression is highly efficient for temporal tasks |
| | |
| | ## Architecture |
| | |
| | | Component | Details | |
| | |-----------|---------| |
| | | **Language Model** | SmolLM2-135M-Instruct | |
| | | **Vision Encoder** | DINOv2-small | |
| | | **Attention** | Deep query-guided foveated cross-attention | |
| | | **Visual Tokens** | 1 token per frame (query-compressed) | |
| | | **Total Parameters** | 157.6M | |
| | | **Query Dimension** | 384 | |
| | | **Visual Scale** | 0.14 | |
| | |
| | ### How Foveated Attention Works |
| | |
| | Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism: |
| | |
| | 1. **DINOv2** encodes each frame into patch features and caches K/V at every layer |
| | 2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention) |
| | 3. The single output token is projected to LLM dimension and prepended to the text sequence |
| | 4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look* |
| | |
| | This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs. |
| | |
| | ## Training Pipeline |
| | |
| | ### Stage 1: Visual Alignment |
| | - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention |
| | - **Loss**: Full-text cross-entropy (predict all tokens) |
| | - **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5 |
| | |
| | ### Stage 2: Vision-Language SFT |
| | - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention |
| | - **Loss**: Answer-only cross-entropy (mask user/system tokens) |
| | - **LR**: Flat 3e-5 all components with cosine decay |
| | |
| | ### Stage 3: DPO (Direct Preference Optimization) |
| | - **Data**: RLAIF-V (83K preference pairs) |
| | - **Loss**: DPO with beta=0.1 |
| | - **LR**: 1e-6 all components |
| | |
| | ## Usage |
| | |
| | ### Setup |
| | |
| | ```python |
| | import torch |
| | from torchvision import transforms |
| | from transformers import AutoTokenizer |
| | from huggingface_hub import hf_hub_download |
| | from release.model import FoveatedVLM |
| |
|
| | # Download checkpoint |
| | ckpt_path = hf_hub_download("sanps/fVLM-135M", "model.safetensors") |
| | |
| | # Build model |
| | model = FoveatedVLM( |
| | llm_name="HuggingFaceTB/SmolLM2-135M-Instruct", |
| | dino_name="facebook/dinov2-small", |
| | query_dim=384, |
| | visual_scale=0.14, |
| | deep_query=True, |
| | ) |
| | |
| | # Load weights |
| | state_dict = torch.load(ckpt_path, map_location="cpu") |
| | model.load_state_dict(state_dict) |
| | model = model.to("cuda").to(torch.bfloat16).eval() |
| |
|
| | tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") |
| | |
| | # Standard DINO preprocessing |
| | frame_transform = transforms.Compose([ |
| | transforms.Resize((224, 224)), |
| | transforms.ToTensor(), |
| | transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), |
| | ]) |
| | ``` |
| | |
| | ### Image Input |
| |
|
| | **Important**: fVLM treats all inputs as video. Static images must be **replicated to 8 frames** to match training distribution (Stage 2 and 3 used `replicate_image_frames: 8`). Passing a single frame for an image will produce degraded results. |
| |
|
| | ```python |
| | from PIL import Image |
| | |
| | img = Image.open("photo.jpg").convert("RGB") |
| | frame_tensor = frame_transform(img) # [3, 224, 224] |
| | frames = frame_tensor.unsqueeze(0).repeat(8, 1, 1, 1) # [8, 3, 224, 224] — replicate to 8 |
| | frames = frames.unsqueeze(0).to("cuda", dtype=torch.bfloat16) # [1, 8, 3, 224, 224] |
| | ``` |
| |
|
| | ### Video Input |
| |
|
| | For video, sample up to 64 frames uniformly. No replication needed. |
| |
|
| | ```python |
| | # video_frames: list of PIL Images (sampled from video) |
| | tensors = [frame_transform(f) for f in video_frames] |
| | frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16) |
| | # frames shape: [1, T, 3, 224, 224] where T = number of frames (1-64) |
| | ``` |
| |
|
| | ### Inference |
| |
|
| | ```python |
| | # Tokenize prompt |
| | messages = [ |
| | {"role": "user", "content": "Describe what is happening in this image."}, |
| | ] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda") |
| | attention_mask = torch.ones_like(input_ids) |
| | loss_mask = torch.ones_like(input_ids, dtype=torch.float32) |
| | |
| | # Forward pass (coarse_fine mode recommended for best quality) |
| | with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16): |
| | result = model( |
| | frames=frames, |
| | input_ids=input_ids, |
| | attention_mask=attention_mask, |
| | loss_mask=loss_mask, |
| | mode="coarse_fine", |
| | ) |
| | # result["logits"]: [B, S, V] text logits |
| | # result["loss"]: scalar cross-entropy loss |
| | ``` |
| |
|
| | ### Inference Modes |
| |
|
| | | Mode | Description | Use Case | |
| | |------|-------------|----------| |
| | | `coarse_only` | Single static-query pass | Fastest; good for images | |
| | | `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention | |
| | | `autoregressive` | Sequential with KV cache | Highest quality for video | |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|