Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -11,11 +11,44 @@ tags:
|
|
| 11 |
- dinov2
|
| 12 |
library_name: pytorch
|
| 13 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# fVLM-1.7B (Foveated Vision-Language Model)
|
| 17 |
|
| 18 |
-
A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
|
| 19 |
|
| 20 |
## Model Description
|
| 21 |
|
|
@@ -45,50 +78,67 @@ Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA)
|
|
| 45 |
|
| 46 |
This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
## Training
|
| 49 |
|
| 50 |
-
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a single A100-80GB GPU
|
| 51 |
|
| 52 |
-
### Stage 1: Visual Alignment (4.3h,
|
| 53 |
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
|
| 54 |
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
|
| 55 |
- **Loss**: Full-text cross-entropy (predict all tokens)
|
| 56 |
-
- **LR**: Converging schedule
|
| 57 |
- **Batch size**: 32
|
| 58 |
|
| 59 |
-
### Stage 2: Vision-Language SFT (9.5h,
|
| 60 |
- **Objective**: Supervised fine-tuning on vision-language tasks
|
| 61 |
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
|
| 62 |
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
|
| 63 |
- **LR**: Flat 3e-5 all components with cosine decay
|
| 64 |
- **Batch size**: 32, gradient checkpointing enabled
|
| 65 |
|
| 66 |
-
### Stage 3: DPO Preference Optimization (1.9h,
|
| 67 |
- **Objective**: Align outputs with human preferences
|
| 68 |
- **Data**: RLAIF-V (83K preference pairs)
|
| 69 |
- **Loss**: DPO with beta=0.1
|
| 70 |
- **LR**: 5e-7 all components
|
| 71 |
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
|
| 72 |
|
| 73 |
-
## Benchmark Results
|
| 74 |
-
|
| 75 |
-
> **Benchmarks are currently running and results will be updated shortly.**
|
| 76 |
-
>
|
| 77 |
-
> Previous benchmark numbers had known issues (see Bug Fixes below) and are being re-evaluated with corrected code.
|
| 78 |
-
|
| 79 |
-
### Inference Modes
|
| 80 |
-
|
| 81 |
-
fVLM supports three inference modes with different speed/quality tradeoffs:
|
| 82 |
-
|
| 83 |
-
| Mode | Description | Use Case |
|
| 84 |
-
|------|-------------|----------|
|
| 85 |
-
| `coarse_only` | Single static-query pass | Fastest; good for images |
|
| 86 |
-
| `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
|
| 87 |
-
| `autoregressive` | Sequential with KV cache | Highest quality for video |
|
| 88 |
-
|
| 89 |
## Bug Fixes in This Version
|
| 90 |
|
| 91 |
-
This release includes several important bug fixes:
|
| 92 |
|
| 93 |
1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
|
| 94 |
|
|
@@ -100,13 +150,14 @@ This release includes several important bug fixes:
|
|
| 100 |
|
| 101 |
| File | Description |
|
| 102 |
|------|-------------|
|
| 103 |
-
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593)
|
| 104 |
| `model.safetensors` | Model weights in safetensors format (previous version) |
|
| 105 |
| `model.py` | Full model architecture code |
|
| 106 |
| `train.py` | Training script (all 3 stages) |
|
| 107 |
| `data.py` | Data loading and preprocessing |
|
| 108 |
| `benchmark.py` | Benchmark evaluation code |
|
| 109 |
| `logger.py` | Logging utilities |
|
|
|
|
| 110 |
|
| 111 |
## Usage
|
| 112 |
|
|
@@ -187,12 +238,25 @@ with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
|
|
| 187 |
input_ids=input_ids,
|
| 188 |
attention_mask=attention_mask,
|
| 189 |
loss_mask=loss_mask,
|
| 190 |
-
mode="coarse_fine",
|
| 191 |
)
|
| 192 |
# result["logits"]: [B, S, V] text logits
|
| 193 |
# result["loss"]: scalar cross-entropy loss
|
| 194 |
```
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
## License
|
| 197 |
|
| 198 |
Apache 2.0
|
|
|
|
| 11 |
- dinov2
|
| 12 |
library_name: pytorch
|
| 13 |
pipeline_tag: image-text-to-text
|
| 14 |
+
model-index:
|
| 15 |
+
- name: fVLM-1.7B
|
| 16 |
+
results:
|
| 17 |
+
- task:
|
| 18 |
+
type: video-question-answering
|
| 19 |
+
name: Video Question Answering
|
| 20 |
+
dataset:
|
| 21 |
+
type: MVBench
|
| 22 |
+
name: MVBench
|
| 23 |
+
metrics:
|
| 24 |
+
- type: accuracy
|
| 25 |
+
value: 30.8
|
| 26 |
+
name: Accuracy (coarse_only)
|
| 27 |
+
- task:
|
| 28 |
+
type: video-question-answering
|
| 29 |
+
name: Video Question Answering
|
| 30 |
+
dataset:
|
| 31 |
+
type: Video-MME
|
| 32 |
+
name: Video-MME
|
| 33 |
+
metrics:
|
| 34 |
+
- type: accuracy
|
| 35 |
+
value: 30.5
|
| 36 |
+
name: Accuracy (coarse_only)
|
| 37 |
+
- task:
|
| 38 |
+
type: question-answering
|
| 39 |
+
name: Science Question Answering
|
| 40 |
+
dataset:
|
| 41 |
+
type: ScienceQA
|
| 42 |
+
name: ScienceQA
|
| 43 |
+
metrics:
|
| 44 |
+
- type: accuracy
|
| 45 |
+
value: 49.0
|
| 46 |
+
name: Accuracy (coarse_only)
|
| 47 |
---
|
| 48 |
|
| 49 |
# fVLM-1.7B (Foveated Vision-Language Model)
|
| 50 |
|
| 51 |
+
A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.
|
| 52 |
|
| 53 |
## Model Description
|
| 54 |
|
|
|
|
| 78 |
|
| 79 |
This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
|
| 80 |
|
| 81 |
+
### Inference Modes
|
| 82 |
+
|
| 83 |
+
fVLM supports three forward modes with different speed/quality tradeoffs:
|
| 84 |
+
|
| 85 |
+
| Mode | Description | Use Case |
|
| 86 |
+
|------|-------------|----------|
|
| 87 |
+
| `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
|
| 88 |
+
| `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
|
| 89 |
+
| `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |
|
| 90 |
+
|
| 91 |
+
## Benchmark Results
|
| 92 |
+
|
| 93 |
+
### fVLM-1.7B (Stage 3 DPO)
|
| 94 |
+
|
| 95 |
+
| Benchmark | Samples | Coarse-Only | Coarse-Fine | Autoregressive |
|
| 96 |
+
|-----------|---------|-------------|-------------|----------------|
|
| 97 |
+
| **MVBench** | 3,800 | **30.8%** | 29.9% | 29.9% |
|
| 98 |
+
| **Video-MME** | 2,700 | **30.5%** | 28.2% | 30.4% |
|
| 99 |
+
| **ScienceQA** | 2,017 | **49.0%** | 43.8% | 46.6% |
|
| 100 |
+
|
| 101 |
+
### fVLM-135M (Stage 3 DPO) -- for comparison
|
| 102 |
+
|
| 103 |
+
| Benchmark | Coarse-Only | Coarse-Fine | Autoregressive |
|
| 104 |
+
|-----------|-------------|-------------|----------------|
|
| 105 |
+
| **MVBench** | 27.4% | 28.0% | 27.9% |
|
| 106 |
+
| **Video-MME** | 26.2% | **29.5%** | 28.7% |
|
| 107 |
+
| **ScienceQA** | **36.4%** | 35.6% | 35.4% |
|
| 108 |
+
|
| 109 |
+
**Key observations:**
|
| 110 |
+
- Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
|
| 111 |
+
- `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
|
| 112 |
+
- At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
|
| 113 |
+
|
| 114 |
## Training
|
| 115 |
|
| 116 |
+
Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
|
| 117 |
|
| 118 |
+
### Stage 1: Visual Alignment (4.3h, 31,250 steps)
|
| 119 |
- **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
|
| 120 |
- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
|
| 121 |
- **Loss**: Full-text cross-entropy (predict all tokens)
|
| 122 |
+
- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
|
| 123 |
- **Batch size**: 32
|
| 124 |
|
| 125 |
+
### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
|
| 126 |
- **Objective**: Supervised fine-tuning on vision-language tasks
|
| 127 |
- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
|
| 128 |
- **Loss**: Answer-only cross-entropy (mask user/system tokens)
|
| 129 |
- **LR**: Flat 3e-5 all components with cosine decay
|
| 130 |
- **Batch size**: 32, gradient checkpointing enabled
|
| 131 |
|
| 132 |
+
### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
|
| 133 |
- **Objective**: Align outputs with human preferences
|
| 134 |
- **Data**: RLAIF-V (83K preference pairs)
|
| 135 |
- **Loss**: DPO with beta=0.1
|
| 136 |
- **LR**: 5e-7 all components
|
| 137 |
- **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
|
| 138 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
## Bug Fixes in This Version
|
| 140 |
|
| 141 |
+
This release includes several important bug fixes over earlier checkpoints:
|
| 142 |
|
| 143 |
1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
|
| 144 |
|
|
|
|
| 150 |
|
| 151 |
| File | Description |
|
| 152 |
|------|-------------|
|
| 153 |
+
| `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
|
| 154 |
| `model.safetensors` | Model weights in safetensors format (previous version) |
|
| 155 |
| `model.py` | Full model architecture code |
|
| 156 |
| `train.py` | Training script (all 3 stages) |
|
| 157 |
| `data.py` | Data loading and preprocessing |
|
| 158 |
| `benchmark.py` | Benchmark evaluation code |
|
| 159 |
| `logger.py` | Logging utilities |
|
| 160 |
+
| `benchmark_results.json` | Full benchmark results with per-category breakdowns |
|
| 161 |
|
| 162 |
## Usage
|
| 163 |
|
|
|
|
| 238 |
input_ids=input_ids,
|
| 239 |
attention_mask=attention_mask,
|
| 240 |
loss_mask=loss_mask,
|
| 241 |
+
mode="coarse_fine", # or "coarse_only" or "autoregressive"
|
| 242 |
)
|
| 243 |
# result["logits"]: [B, S, V] text logits
|
| 244 |
# result["loss"]: scalar cross-entropy loss
|
| 245 |
```
|
| 246 |
|
| 247 |
+
## Citation
|
| 248 |
+
|
| 249 |
+
If you use this model, please cite:
|
| 250 |
+
|
| 251 |
+
```bibtex
|
| 252 |
+
@misc{fvlm2025,
|
| 253 |
+
title={fVLM: Foveated Vision-Language Model},
|
| 254 |
+
author={Sandeep Sampath Kumar},
|
| 255 |
+
year={2025},
|
| 256 |
+
url={https://huggingface.co/sanps/fVLM-1.7B}
|
| 257 |
+
}
|
| 258 |
+
```
|
| 259 |
+
|
| 260 |
## License
|
| 261 |
|
| 262 |
Apache 2.0
|