sanps commited on
Commit
7fbae21
·
verified ·
1 Parent(s): 98b7474

Update model card: training info, bug fixes, benchmark status

Browse files
Files changed (1) hide show
  1. README.md +51 -62
README.md CHANGED
@@ -17,45 +17,11 @@ pipeline_tag: image-text-to-text
17
 
18
  A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
19
 
20
- ## Benchmark Results
21
-
22
- ### Video Benchmarks
23
-
24
- | Benchmark | fVLM-135M | fVLM-1.7B | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
25
- |-----------|:---------:|:---------:|:------------:|:------------:|:------------:|
26
- | **MVBench** (3800 MCQ) | 28.0% | 31.8% | 32.7% | 39.7% | 46.3% |
27
- | **Video-MME** (2700 MCQ) | 29.5% | 30.2% | 33.7% | 42.2% | 52.1% |
28
-
29
- ### Image Benchmarks
30
-
31
- | Benchmark | fVLM-135M | fVLM-1.7B | SmolVLM2-256M | SmolVLM2-500M | SmolVLM2-2.2B |
32
- |-----------|:---------:|:---------:|:------------:|:------------:|:------------:|
33
- | **ScienceQA** (2017 MCQ) | 36.0% | 51.5% | 73.8% | 80.0% | 89.6% |
34
-
35
- > **Key context**: fVLM uses **1 visual token per frame** vs SmolVLM2's 64-256 tokens per image. fVLM-1.7B has ~1.8B params total — smaller than SmolVLM2-2.2B but with extreme visual compression.
36
-
37
- ### Results by Inference Mode
38
-
39
- fVLM supports three inference modes with different speed/quality tradeoffs:
40
-
41
- | Benchmark | Coarse-Only | Coarse→Fine | Autoregressive |
42
- |-----------|:----------:|:-----------:|:--------------:|
43
- | MVBench | 31.8% | 31.5% | 30.2% |
44
- | Video-MME | 28.8% | 30.2% | 29.7% |
45
- | ScienceQA | 51.5% | 47.1% | 46.2% |
46
-
47
- - **Coarse-Only**: Single static-query pass (fastest, no foveation)
48
- - **Coarse→Fine**: Two-pass parallel forward (training mode, with foveated attention)
49
- - **Autoregressive**: Sequential inference with KV cache (highest quality)
50
 
51
- ### Analysis
52
 
53
- - **Foveation helps on video**: coarse→fine adds significant improvement on Video-MME over coarse-only, confirming that learned "where to look" queries improve video understanding
54
- - **Scale-up from 135M→1.7B**: Larger LLM backbone improves reasoning across all benchmarks
55
- - **ScienceQA**: Shows the benefit of a stronger language backbone for reasoning tasks
56
- - **Efficiency**: Despite using only 1 visual token per frame, fVLM-1.7B narrows the gap with multi-token VLMs
57
-
58
- ## Architecture
59
 
60
  | Component | Details |
61
  |-----------|---------|
@@ -79,29 +45,68 @@ Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA)
79
 
80
  This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
81
 
82
- ## Training Pipeline
83
 
84
- Trained on a single A100-80GB GPU.
85
 
86
  ### Stage 1: Visual Alignment (4.3h, 31250 steps)
 
87
  - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
88
  - **Loss**: Full-text cross-entropy (predict all tokens)
89
  - **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
90
  - **Batch size**: 32
91
 
92
  ### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
 
93
  - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
94
  - **Loss**: Answer-only cross-entropy (mask user/system tokens)
95
  - **LR**: Flat 3e-5 all components with cosine decay
96
  - **Batch size**: 32, gradient checkpointing enabled
97
 
98
- ### Stage 3: DPO (1.9h, 2593 steps)
 
99
  - **Data**: RLAIF-V (83K preference pairs)
100
  - **Loss**: DPO with beta=0.1
101
  - **LR**: 5e-7 all components
102
  - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
103
 
104
- **Total training time**: ~16 hours on 1x A100-80GB
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  ## Usage
107
 
@@ -114,9 +119,9 @@ from transformers import AutoTokenizer
114
  from huggingface_hub import hf_hub_download
115
 
116
  # Download checkpoint
117
- ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "model.safetensors")
118
 
119
- # Build model (requires model code from this repo)
120
  from model import FoveatedVLM
121
 
122
  model = FoveatedVLM(
@@ -128,9 +133,8 @@ model = FoveatedVLM(
128
  )
129
 
130
  # Load weights
131
- from safetensors.torch import load_file
132
- state_dict = load_file(ckpt_path)
133
- model.load_state_dict(state_dict)
134
  model = model.to("cuda").to(torch.bfloat16).eval()
135
 
136
  tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
@@ -170,7 +174,7 @@ frames = torch.stack(tensors).unsqueeze(0).to("cuda", dtype=torch.bfloat16)
170
 
171
  ```python
172
  messages = [
173
- {{"role": "user", "content": "Describe what is happening in this image."}},
174
  ]
175
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
176
  input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
@@ -189,21 +193,6 @@ with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
189
  # result["loss"]: scalar cross-entropy loss
190
  ```
191
 
192
- ### Inference Modes
193
-
194
- | Mode | Description | Use Case |
195
- |------|-------------|----------|
196
- | `coarse_only` | Single static-query pass | Fastest; good for images |
197
- | `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
198
- | `autoregressive` | Sequential with KV cache | Highest quality for video |
199
-
200
- ## Config Files
201
-
202
- Training configs included:
203
- - `configs/stage1_1.7B.yaml` — Visual alignment
204
- - `configs/stage2_1.7B.yaml` — Vision-language SFT
205
- - `configs/stage3_1.7B.yaml` — DPO preference optimization
206
-
207
  ## License
208
 
209
  Apache 2.0
 
17
 
18
  A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
19
 
20
+ ## Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ **fVLM-1.7B** is built on **SmolLM2-1.7B-Instruct** (language backbone) + **DINOv2-small** (vision encoder), connected via a foveated cross-attention mechanism that compresses each video frame into **1 visual token**. This extreme compression enables processing 64+ frames within the same context window budget that traditional VLMs use for a single image.
23
 
24
+ ### Architecture
 
 
 
 
 
25
 
26
  | Component | Details |
27
  |-----------|---------|
 
45
 
46
  This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
47
 
48
+ ## Training
49
 
50
+ Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a single A100-80GB GPU. **Total training time: ~16 hours.**
51
 
52
  ### Stage 1: Visual Alignment (4.3h, 31250 steps)
53
+ - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
54
  - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
55
  - **Loss**: Full-text cross-entropy (predict all tokens)
56
  - **LR**: Converging schedule — connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
57
  - **Batch size**: 32
58
 
59
  ### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
60
+ - **Objective**: Supervised fine-tuning on vision-language tasks
61
  - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
62
  - **Loss**: Answer-only cross-entropy (mask user/system tokens)
63
  - **LR**: Flat 3e-5 all components with cosine decay
64
  - **Batch size**: 32, gradient checkpointing enabled
65
 
66
+ ### Stage 3: DPO Preference Optimization (1.9h, 2593 steps)
67
+ - **Objective**: Align outputs with human preferences
68
  - **Data**: RLAIF-V (83K preference pairs)
69
  - **Loss**: DPO with beta=0.1
70
  - **LR**: 5e-7 all components
71
  - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
72
 
73
+ ## Benchmark Results
74
+
75
+ > **Benchmarks are currently running and results will be updated shortly.**
76
+ >
77
+ > Previous benchmark numbers had known issues (see Bug Fixes below) and are being re-evaluated with corrected code.
78
+
79
+ ### Inference Modes
80
+
81
+ fVLM supports three inference modes with different speed/quality tradeoffs:
82
+
83
+ | Mode | Description | Use Case |
84
+ |------|-------------|----------|
85
+ | `coarse_only` | Single static-query pass | Fastest; good for images |
86
+ | `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
87
+ | `autoregressive` | Sequential with KV cache | Highest quality for video |
88
+
89
+ ## Bug Fixes in This Version
90
+
91
+ This release includes several important bug fixes:
92
+
93
+ 1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
94
+
95
+ 2. **Stage 2 OOM skip rate fix**: During Stage 2 SFT training, out-of-memory errors on large batches were being silently skipped at a high rate, effectively reducing the training data seen. Fixed to properly handle memory management and reduce skip rate.
96
+
97
+ 3. **Benchmark letter-bias fix**: The benchmark evaluation code had a bias toward certain answer letters in multiple-choice questions, inflating scores for some options. Fixed to ensure fair evaluation across all answer choices.
98
+
99
+ ## Files
100
+
101
+ | File | Description |
102
+ |------|-------------|
103
+ | `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) — PyTorch format |
104
+ | `model.safetensors` | Model weights in safetensors format (previous version) |
105
+ | `model.py` | Full model architecture code |
106
+ | `train.py` | Training script (all 3 stages) |
107
+ | `data.py` | Data loading and preprocessing |
108
+ | `benchmark.py` | Benchmark evaluation code |
109
+ | `logger.py` | Logging utilities |
110
 
111
  ## Usage
112
 
 
119
  from huggingface_hub import hf_hub_download
120
 
121
  # Download checkpoint
122
+ ckpt_path = hf_hub_download("sanps/fVLM-1.7B", "checkpoint.pt")
123
 
124
+ # Build model
125
  from model import FoveatedVLM
126
 
127
  model = FoveatedVLM(
 
133
  )
134
 
135
  # Load weights
136
+ ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
137
+ model.load_state_dict(ckpt["model"] if "model" in ckpt else ckpt)
 
138
  model = model.to("cuda").to(torch.bfloat16).eval()
139
 
140
  tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B-Instruct")
 
174
 
175
  ```python
176
  messages = [
177
+ {"role": "user", "content": "Describe what is happening in this image."},
178
  ]
179
  text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
180
  input_ids = tokenizer.encode(text, return_tensors="pt").to("cuda")
 
193
  # result["loss"]: scalar cross-entropy loss
194
  ```
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  ## License
197
 
198
  Apache 2.0