sanps commited on
Commit
8631a23
·
verified ·
1 Parent(s): 7fbae21

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +89 -25
README.md CHANGED
@@ -11,11 +11,44 @@ tags:
11
  - dinov2
12
  library_name: pytorch
13
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
  # fVLM-1.7B (Foveated Vision-Language Model)
17
 
18
- A vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
19
 
20
  ## Model Description
21
 
@@ -45,50 +78,67 @@ Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA)
45
 
46
  This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ## Training
49
 
50
- Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a single A100-80GB GPU. **Total training time: ~16 hours.**
51
 
52
- ### Stage 1: Visual Alignment (4.3h, 31250 steps)
53
  - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
54
  - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
55
  - **Loss**: Full-text cross-entropy (predict all tokens)
56
- - **LR**: Converging schedule connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
57
  - **Batch size**: 32
58
 
59
- ### Stage 2: Vision-Language SFT (9.5h, 31250 steps)
60
  - **Objective**: Supervised fine-tuning on vision-language tasks
61
  - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
62
  - **Loss**: Answer-only cross-entropy (mask user/system tokens)
63
  - **LR**: Flat 3e-5 all components with cosine decay
64
  - **Batch size**: 32, gradient checkpointing enabled
65
 
66
- ### Stage 3: DPO Preference Optimization (1.9h, 2593 steps)
67
  - **Objective**: Align outputs with human preferences
68
  - **Data**: RLAIF-V (83K preference pairs)
69
  - **Loss**: DPO with beta=0.1
70
  - **LR**: 5e-7 all components
71
  - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
72
 
73
- ## Benchmark Results
74
-
75
- > **Benchmarks are currently running and results will be updated shortly.**
76
- >
77
- > Previous benchmark numbers had known issues (see Bug Fixes below) and are being re-evaluated with corrected code.
78
-
79
- ### Inference Modes
80
-
81
- fVLM supports three inference modes with different speed/quality tradeoffs:
82
-
83
- | Mode | Description | Use Case |
84
- |------|-------------|----------|
85
- | `coarse_only` | Single static-query pass | Fastest; good for images |
86
- | `coarse_fine` | Two-pass parallel forward | Best overall; uses foveated attention |
87
- | `autoregressive` | Sequential with KV cache | Highest quality for video |
88
-
89
  ## Bug Fixes in This Version
90
 
91
- This release includes several important bug fixes:
92
 
93
  1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
94
 
@@ -100,13 +150,14 @@ This release includes several important bug fixes:
100
 
101
  | File | Description |
102
  |------|-------------|
103
- | `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) PyTorch format |
104
  | `model.safetensors` | Model weights in safetensors format (previous version) |
105
  | `model.py` | Full model architecture code |
106
  | `train.py` | Training script (all 3 stages) |
107
  | `data.py` | Data loading and preprocessing |
108
  | `benchmark.py` | Benchmark evaluation code |
109
  | `logger.py` | Logging utilities |
 
110
 
111
  ## Usage
112
 
@@ -187,12 +238,25 @@ with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
187
  input_ids=input_ids,
188
  attention_mask=attention_mask,
189
  loss_mask=loss_mask,
190
- mode="coarse_fine",
191
  )
192
  # result["logits"]: [B, S, V] text logits
193
  # result["loss"]: scalar cross-entropy loss
194
  ```
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
  ## License
197
 
198
  Apache 2.0
 
11
  - dinov2
12
  library_name: pytorch
13
  pipeline_tag: image-text-to-text
14
+ model-index:
15
+ - name: fVLM-1.7B
16
+ results:
17
+ - task:
18
+ type: video-question-answering
19
+ name: Video Question Answering
20
+ dataset:
21
+ type: MVBench
22
+ name: MVBench
23
+ metrics:
24
+ - type: accuracy
25
+ value: 30.8
26
+ name: Accuracy (coarse_only)
27
+ - task:
28
+ type: video-question-answering
29
+ name: Video Question Answering
30
+ dataset:
31
+ type: Video-MME
32
+ name: Video-MME
33
+ metrics:
34
+ - type: accuracy
35
+ value: 30.5
36
+ name: Accuracy (coarse_only)
37
+ - task:
38
+ type: question-answering
39
+ name: Science Question Answering
40
+ dataset:
41
+ type: ScienceQA
42
+ name: ScienceQA
43
+ metrics:
44
+ - type: accuracy
45
+ value: 49.0
46
+ name: Accuracy (coarse_only)
47
  ---
48
 
49
  # fVLM-1.7B (Foveated Vision-Language Model)
50
 
51
+ A vision-language model that uses **foveated attention** to compress each video frame into a **single visual token**, enabling efficient processing of long videos on a single GPU.
52
 
53
  ## Model Description
54
 
 
78
 
79
  This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
80
 
81
+ ### Inference Modes
82
+
83
+ fVLM supports three forward modes with different speed/quality tradeoffs:
84
+
85
+ | Mode | Description | Use Case |
86
+ |------|-------------|----------|
87
+ | `coarse_only` | Single static-query pass | Fastest; good for images and quick inference |
88
+ | `coarse_fine` | Two-pass parallel forward (soft attention) | Training mode; uses foveated attention |
89
+ | `autoregressive` | Sequential with KV cache (hard attention) | Iterative foveation for video |
90
+
91
+ ## Benchmark Results
92
+
93
+ ### fVLM-1.7B (Stage 3 DPO)
94
+
95
+ | Benchmark | Samples | Coarse-Only | Coarse-Fine | Autoregressive |
96
+ |-----------|---------|-------------|-------------|----------------|
97
+ | **MVBench** | 3,800 | **30.8%** | 29.9% | 29.9% |
98
+ | **Video-MME** | 2,700 | **30.5%** | 28.2% | 30.4% |
99
+ | **ScienceQA** | 2,017 | **49.0%** | 43.8% | 46.6% |
100
+
101
+ ### fVLM-135M (Stage 3 DPO) -- for comparison
102
+
103
+ | Benchmark | Coarse-Only | Coarse-Fine | Autoregressive |
104
+ |-----------|-------------|-------------|----------------|
105
+ | **MVBench** | 27.4% | 28.0% | 27.9% |
106
+ | **Video-MME** | 26.2% | **29.5%** | 28.7% |
107
+ | **ScienceQA** | **36.4%** | 35.6% | 35.4% |
108
+
109
+ **Key observations:**
110
+ - Scaling from 135M to 1.7B yields significant gains across all benchmarks, especially on ScienceQA (+12.6 points absolute).
111
+ - `coarse_only` is the strongest mode at 1.7B scale, suggesting the static query already captures most relevant information.
112
+ - At 135M scale, the `coarse_fine` foveation mechanism provides more benefit (e.g., +3.3 on Video-MME), consistent with smaller models needing the iterative refinement more.
113
+
114
  ## Training
115
 
116
+ Trained with a **3-stage pipeline** (alignment, SFT, DPO) on a **single A100-80GB GPU**. Total training time: ~16 hours.
117
 
118
+ ### Stage 1: Visual Alignment (4.3h, 31,250 steps)
119
  - **Objective**: Align DINOv2 visual features with the SmolLM2 text embedding space
120
  - **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
121
  - **Loss**: Full-text cross-entropy (predict all tokens)
122
+ - **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
123
  - **Batch size**: 32
124
 
125
+ ### Stage 2: Vision-Language SFT (9.5h, 31,250 steps)
126
  - **Objective**: Supervised fine-tuning on vision-language tasks
127
  - **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
128
  - **Loss**: Answer-only cross-entropy (mask user/system tokens)
129
  - **LR**: Flat 3e-5 all components with cosine decay
130
  - **Batch size**: 32, gradient checkpointing enabled
131
 
132
+ ### Stage 3: DPO Preference Optimization (1.9h, 2,593 steps)
133
  - **Objective**: Align outputs with human preferences
134
  - **Data**: RLAIF-V (83K preference pairs)
135
  - **Loss**: DPO with beta=0.1
136
  - **LR**: 5e-7 all components
137
  - **Batch size**: 8, grad accumulation 4 (effective batch 32), gradient checkpointing enabled
138
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  ## Bug Fixes in This Version
140
 
141
+ This release includes several important bug fixes over earlier checkpoints:
142
 
143
  1. **`eos_token` / `ignore_index` collision**: The EOS token ID was colliding with the `ignore_index` value used in cross-entropy loss, causing the model to never learn to produce EOS tokens properly. Fixed by using a non-colliding ignore index.
144
 
 
150
 
151
  | File | Description |
152
  |------|-------------|
153
+ | `checkpoint.pt` | Stage 3 (DPO) final checkpoint (step 2593) -- PyTorch format |
154
  | `model.safetensors` | Model weights in safetensors format (previous version) |
155
  | `model.py` | Full model architecture code |
156
  | `train.py` | Training script (all 3 stages) |
157
  | `data.py` | Data loading and preprocessing |
158
  | `benchmark.py` | Benchmark evaluation code |
159
  | `logger.py` | Logging utilities |
160
+ | `benchmark_results.json` | Full benchmark results with per-category breakdowns |
161
 
162
  ## Usage
163
 
 
238
  input_ids=input_ids,
239
  attention_mask=attention_mask,
240
  loss_mask=loss_mask,
241
+ mode="coarse_fine", # or "coarse_only" or "autoregressive"
242
  )
243
  # result["logits"]: [B, S, V] text logits
244
  # result["loss"]: scalar cross-entropy loss
245
  ```
246
 
247
+ ## Citation
248
+
249
+ If you use this model, please cite:
250
+
251
+ ```bibtex
252
+ @misc{fvlm2025,
253
+ title={fVLM: Foveated Vision-Language Model},
254
+ author={Sandeep Sampath Kumar},
255
+ year={2025},
256
+ url={https://huggingface.co/sanps/fVLM-1.7B}
257
+ }
258
+ ```
259
+
260
  ## License
261
 
262
  Apache 2.0