sanps
/

fVLM-1.7B

@@ -11,11 +11,44 @@ tags:
   - dinov2
 library_name: pytorch
 pipeline_tag: image-text-to-text
 ---
 # fVLM-1.7B (Foveated Vision-Language Model)
-A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
 ## Model Description
@@ -45,50 +78,67 @@ Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA)
 This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
 ## Training
-Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a single A100-80GB GPU. **Total training time: ~16 hours.**
-### Stage 1: Visual Alignment (4.3h, 31250 steps)
 - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
 - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
 - **Loss**: Full-text cross-entropy (predict all tokens)
-- **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
 - **Batch size**: 32
-### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
 - **Objective**: Supervised fine-tuning on vision-language tasks
 - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
 - **Loss**: Answer-only cross-entropy (mask user/system tokens)
 - **LR**: Flat 3e-5 all components with cosine decay
 - **Batch size**: 32, gradient checkpointing enabled
-### Stage 3: DPO Preference Optimization (1.9h, 2593 steps)
 - **Objective**: Align outputs with human preferences
 - **Data**: RLAIF-V (83K preference pairs)
 - **Loss**: DPO with beta=0.1
 - **LR**: 5e-7 all components
 - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
-## Benchmark Results
-> **Benchmarks are currently running and results will be updated shortly.**
->
-> Previous benchmark numbers had known issues (see Bug Fixes below) and are being re-evaluated with corrected code.
-### Inference Modes
-fVLM supports three inference modes with different speed/quality tradeoffs:
-| Mode | Description | Use Case |
-|------|-------------|----------|
-| `coarse_only` | Single static-query pass | Fastest; good for images |
-| `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
-| `autoregressive` | Sequential with KV cache | Highest quality for video |
 ## Bug Fixes in This Version
-This release includes several important bug fixes:
 1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
@@ -100,13 +150,14 @@ This release includes several important bug fixes:
 | File | Description |
 |------|-------------|
-| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) — PyTorch format |
 | `model.safetensors` | Model weights in safetensors format (previous version) |
 | `model.py` | Full model architecture code |
 | `train.py` | Training script (all 3 stages) |
 | `data.py` | Data loading and preprocessing |
 | `benchmark.py` | Benchmark evaluation code |
 | `logger.py` | Logging utilities |
 ## Usage
@@ -187,12 +238,25 @@ with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
         input_ids=input_ids,
         attention_mask=attention_mask,
         loss_mask=loss_mask,
-        mode="coarse_fine",
     )
 # result["logits"]: [B, S, V] text logits
 # result["loss"]: scalar cross-entropy loss
 ```
 ## License
 Apache 2.0

   - dinov2
 library_name: pytorch
 pipeline_tag: image-text-to-text
+model-index:
+  - name: fVLM-1.7B
+    results:
+      - task:
+          type: video-question-answering
+          name: Video Question Answering
+        dataset:
+          type: MVBench
+          name: MVBench
+        metrics:
+          - type: accuracy
+            value: 30.8
+            name: Accuracy (coarse_only)
+      - task:
+          type: video-question-answering
+          name: Video Question Answering
+        dataset:
+          type: Video-MME
+          name: Video-MME
+        metrics:
+          - type: accuracy
+            value: 30.5
+            name: Accuracy (coarse_only)
+      - task:
+          type: question-answering
+          name: Science Question Answering
+        dataset:
+          type: ScienceQA
+          name: ScienceQA
+        metrics:
+          - type: accuracy
+            value: 49.0
+            name: Accuracy (coarse_only)
 ---
 # fVLM-1.7B (Foveated Vision-Language Model)
+A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.
 ## Model Description
 This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
+### Inference Modes
+fVLM supports three forward modes with different speed/quality tradeoffs:
+| Mode | Description | Use Case |
+|------|-------------|----------|
+| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
+| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
+| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |
+## Benchmark Results
+### fVLM-1.7B (Stage 3 DPO)
+| Benchmark | Samples | Coarse-Only | Coarse-Fine | Autoregressive |
+|-----------|---------|-------------|-------------|----------------|
+| **MVBench** | 3,800 | **30.8%** | 29.9% | 29.9% |
+| **Video-MME** | 2,700 | **30.5%** | 28.2% | 30.4% |
+| **ScienceQA** | 2,017 | **49.0%** | 43.8% | 46.6% |
+### fVLM-135M (Stage 3 DPO) -- for comparison
+| Benchmark | Coarse-Only | Coarse-Fine | Autoregressive |
+|-----------|-------------|-------------|----------------|
+| **MVBench** | 27.4% | 28.0% | 27.9% |
+| **Video-MME** | 26.2% | **29.5%** | 28.7% |
+| **ScienceQA** | **36.4%** | 35.6% | 35.4% |
+**Key observations:**
+- Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
+- `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
+- At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
 ## Training
+Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
+### Stage 1: Visual Alignment (4.3h, 31,250 steps)
 - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
 - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
 - **Loss**: Full-text cross-entropy (predict all tokens)
+- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
 - **Batch size**: 32
+### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
 - **Objective**: Supervised fine-tuning on vision-language tasks
 - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
 - **Loss**: Answer-only cross-entropy (mask user/system tokens)
 - **LR**: Flat 3e-5 all components with cosine decay
 - **Batch size**: 32, gradient checkpointing enabled
+### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
 - **Objective**: Align outputs with human preferences
 - **Data**: RLAIF-V (83K preference pairs)
 - **Loss**: DPO with beta=0.1
 - **LR**: 5e-7 all components
 - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
 ## Bug Fixes in This Version
+This release includes several important bug fixes over earlier checkpoints:
 1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
 | File | Description |
 |------|-------------|
+| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
 | `model.safetensors` | Model weights in safetensors format (previous version) |
 | `model.py` | Full model architecture code |
 | `train.py` | Training script (all 3 stages) |
 | `data.py` | Data loading and preprocessing |
 | `benchmark.py` | Benchmark evaluation code |
 | `logger.py` | Logging utilities |
+| `benchmark_results.json` | Full benchmark results with per-category breakdowns |
 ## Usage
         input_ids=input_ids,
         attention_mask=attention_mask,
         loss_mask=loss_mask,
+        mode="coarse_fine",       # or "coarse_only" or "autoregressive"
     )
 # result["logits"]: [B, S, V] text logits
 # result["loss"]: scalar cross-entropy loss
 ```
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{fvlm2025,
+  title={fVLM: Foveated Vision-Language Model},
+  author={Sandeep Sampath Kumar},
+  year={2025},
+  url={https://huggingface.co/sanps/fVLM-1.7B}
+}
+```
 ## License
 Apache 2.0