sanps
/

fVLM-1.7B

@@ -17,45 +17,11 @@ pipeline_tag: image-text-to-text
 A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
-## Benchmark Results
-### Video Benchmarks
-| Benchmark | fVLM-135M | fVLM-1.7B | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
-|-----------|:---------:|:---------:|:------------:|:------------:|:------------:|
-| **MVBench** (3800 MCQ) | 28.0% | 31.8% | 32.7% | 39.7% | 46.3% |
-| **Video-MME** (2700 MCQ) | 29.5% | 30.2% | 33.7% | 42.2% | 52.1% |
-### Image Benchmarks
-| Benchmark | fVLM-135M | fVLM-1.7B | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
-|-----------|:---------:|:---------:|:------------:|:------------:|:------------:|
-| **ScienceQA** (2017 MCQ) | 36.0% | 51.5% | 73.8% | 80.0% | 89.6% |
-> **Key context**: fVLM uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-1.7B has ~1.8B params total — smaller than SmolVLM2-2.2B but with extreme visual compression.
-### Results by Inference Mode
-fVLM supports three inference modes with different speed/quality tradeoffs:
-| Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
-|-----------|:----------:|:-----------:|:--------------:|
-| MVBench | 31.8% | 31.5% | 30.2% |
-| Video-MME | 28.8% | 30.2% | 29.7% |
-| ScienceQA | 51.5% | 47.1% | 46.2% |
-- **Coarse-Only**: Single static-query pass (fastest, no foveation)
-- **Coarse→Fine**: Two-pass parallel forward (training mode, with foveated attention)
-- **Autoregressive**: Sequential inference with KV cache (highest quality)
-### Analysis
-- **Foveation helps on video**: coarse→fine adds significant improvement on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
-- **Scale-up from 135M→1.7B**: Larger LLM backbone improves reasoning across all benchmarks
-- **ScienceQA**: Shows the benefit of a stronger language backbone for reasoning tasks
-- **Efficiency**: Despite using only 1 visual token per frame, fVLM-1.7B narrows the gap with multi-token VLMs
-## Architecture
 | Component | Details |
 |-----------|---------|
@@ -79,29 +45,68 @@ Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA)
 This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
-## Training Pipeline
-Trained on a single A100-80GB GPU.
 ### Stage 1: Visual Alignment (4.3h, 31250 steps)
 - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
 - **Loss**: Full-text cross-entropy (predict all tokens)
 - **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
 - **Batch size**: 32
 ### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
 - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
 - **Loss**: Answer-only cross-entropy (mask user/system tokens)
 - **LR**: Flat 3e-5 all components with cosine decay
 - **Batch size**: 32, gradient checkpointing enabled
-### Stage 3: DPO (1.9h, 2593 steps)
 - **Data**: RLAIF-V (83K preference pairs)
 - **Loss**: DPO with beta=0.1
 - **LR**: 5e-7 all components
 - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
-**Total training time**: ~16 hours on 1x A100-80GB
 ## Usage
@@ -114,9 +119,9 @@ from transformers import AutoTokenizer
 from huggingface_hub import hf_hub_download
 # Download checkpoint
-ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "model.safetensors")
-# Build model (requires model code from this repo)
 from model import FoveatedVLM
 model = FoveatedVLM(
@@ -128,9 +133,8 @@ model = FoveatedVLM(
 )
 # Load weights
-from safetensors.torch import load_file
-state_dict = load_file(ckpt_path)
-model.load_state_dict(state_dict)
 model = model.to("cuda").to(torch.bfloat16).eval()
 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
@@ -170,7 +174,7 @@ frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
 ```python
 messages = [
-    {{"role": "user", "content": "Describe what is happening in this image."}},
 ]
 text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
@@ -189,21 +193,6 @@ with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
 # result["loss"]: scalar cross-entropy loss
 ```
-### Inference Modes
-| Mode | Description | Use Case |
-|------|-------------|----------|
-| `coarse_only` | Single static-query pass | Fastest; good for images |
-| `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
-| `autoregressive` | Sequential with KV cache | Highest quality for video |
-## Config Files
-Training configs included:
-- `configs/stage1_1.7B.yaml` — Visual alignment
-- `configs/stage2_1.7B.yaml` — Vision-language SFT
-- `configs/stage3_1.7B.yaml` — DPO preference optimization
 ## License
 Apache 2.0

 A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
+## Model Description
+**fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.
+### Architecture
 | Component | Details |
 |-----------|---------|
 This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
+## Training
+Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a single A100-80GB GPU. **Total training time: ~16 hours.**
 ### Stage 1: Visual Alignment (4.3h, 31250 steps)
+- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
 - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
 - **Loss**: Full-text cross-entropy (predict all tokens)
 - **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
 - **Batch size**: 32
 ### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
+- **Objective**: Supervised fine-tuning on vision-language tasks
 - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
 - **Loss**: Answer-only cross-entropy (mask user/system tokens)
 - **LR**: Flat 3e-5 all components with cosine decay
 - **Batch size**: 32, gradient checkpointing enabled
+### Stage 3: DPO Preference Optimization (1.9h, 2593 steps)
+- **Objective**: Align outputs with human preferences
 - **Data**: RLAIF-V (83K preference pairs)
 - **Loss**: DPO with beta=0.1
 - **LR**: 5e-7 all components
 - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
+## Benchmark Results
+> **Benchmarks are currently running and results will be updated shortly.**
+>
+> Previous benchmark numbers had known issues (see Bug Fixes below) and are being re-evaluated with corrected code.
+### Inference Modes
+fVLM supports three inference modes with different speed/quality tradeoffs:
+| Mode | Description | Use Case |
+|------|-------------|----------|
+| `coarse_only` | Single static-query pass | Fastest; good for images |
+| `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
+| `autoregressive` | Sequential with KV cache | Highest quality for video |
+## Bug Fixes in This Version
+This release includes several important bug fixes:
+1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
+2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
+3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.
+## Files
+| File | Description |
+|------|-------------|
+| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) — PyTorch format |
+| `model.safetensors` | Model weights in safetensors format (previous version) |
+| `model.py` | Full model architecture code |
+| `train.py` | Training script (all 3 stages) |
+| `data.py` | Data loading and preprocessing |
+| `benchmark.py` | Benchmark evaluation code |
+| `logger.py` | Logging utilities |
 ## Usage
 from huggingface_hub import hf_hub_download
 # Download checkpoint
+ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")
+# Build model
 from model import FoveatedVLM
 model = FoveatedVLM(
 )
 # Load weights
+ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
 model = model.to("cuda").to(torch.bfloat16).eval()
 tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
 ```python
 messages = [
+    {"role": "user", "content": "Describe what is happening in this image."},
 ]
 text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
 # result["loss"]: scalar cross-entropy loss
 ```
 ## License
 Apache 2.0