File size: 12,834 Bytes
82bedcd
 
 
 
 
 
 
 
 
 
 
c374021
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: VLM Caption Lab
emoji: πŸ–ΌοΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.0
app_file: app.py
pinned: false
---

# πŸ”¬ VLM Caption Lab

**Compare how different Vision-Language Models look at images while writing captions β€” four architectures, one dataset, one evaluation metric.**

VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to **image captioning** (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.

---

## Architecture Comparison

| Architecture | How It Looks at the Image | Total Parameters | Best CIDEr Score |
|---|---|---|---|
| **BLIP** | Selective gated attention β€” looks at image only when needed | 224M | **0.6199** (optimized) |
| **ViT-GPT2** | Full attention β€” looks at entire image for every word | 239M | ~0.55 |
| **GIT** | Memory-based β€” memorizes image first, writes from memory | 177M | ~0.54 |
| **Custom VLM** | Built from scratch β€” Shakespeare decoder + visual bridge | 103M (16.2M trainable) | **0.2863** |

> **What is CIDEr?** CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.

---

## 🌐 Live Demo & Deployment

**The easiest way to test this project is via the live web demo.**
> πŸ‘‰ **[Insert Your Live Hosted Link Here]**

*(If deploying yourself, see the `DEPLOYMENT_GUIDE.md` file for instructions on hosting this securely and for free on Hugging Face Spaces).*

---

## Quick Start (Local Run)

If you prefer to run this locally rather than using the web demo, follow these steps. 

> ⚠️ **Note on Weights**: You do *not* need to train the models yourself to test the app.
> - Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.
> - The Custom VLM text-decoder weights (`shakespeare_transformer.pt`) are included in this repo.
> - **To skip training completely**, you only need to run `streamlit run app.py`!

### Prerequisites

- Python 3.9 or newer
- macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
- ~8 GB disk space for model checkpoints

### Setup

```bash
# Clone the repository
git clone <repo-url>
cd project_02

# Create a virtual environment
python -m venv venv
source venv/bin/activate

# Install all dependencies
pip install -r requirements.txt

# Verify that GPU acceleration is available
python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"
```

### Dependencies

| Package | What It Does |
|---|---|
| `torch` | Deep learning framework (training and inference) |
| `transformers` | Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace |
| `datasets` | Download and load MS-COCO caption dataset from HuggingFace |
| `streamlit` | Interactive web demo interface |
| `pycocoevalcap` | Compute CIDEr scores (caption quality metric) |
| `detoxify` | Safety filter β€” checks captions for toxic or offensive content |
| `Pillow` | Image loading and processing |
| `accelerate` | Training efficiency utilities |

---

## πŸš€ What to Expect on First Run

When someone clones this repository and runs `streamlit run app.py` (or `train.py`) for the very first time, here is exactly what happens:

1. **Automatic Model Downloads**: You do *not* need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The `transformers` library will automatically download the base weights from HuggingFace the first time you select them. 
2. **Download Time**: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
3. **Custom VLM Weights**: The `shakespeare_transformer.pt` file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading.
4. **Fine-Tuned Weights**: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (`python train.py --model [name]`). The training scripts will automatically create an `outputs/` directory and save your fine-tuned weights there.

---

## Training

All four models are trained through one unified script:

```bash
# Train individual models
python train.py --model blip          # ~1.5 hours on Apple Silicon
python train.py --model vit_gpt2      # ~1 hour
python train.py --model git           # ~20 minutes
python train.py --model custom        # ~3 hours (15 epochs)
```

### What happens during training

1. **Dataset loading** β€” Downloads MS-COCO captions from HuggingFace (cached after first download)
2. **Training** β€” Images are processed by the vision encoder, captions by the text decoder
3. **Validation** β€” After each epoch, computes validation loss + CIDEr score on held-out images
4. **Checkpointing** β€” Saves two checkpoints:
   - `outputs/{model}/best/` β€” The model with the **highest CIDEr score** (use this for evaluation)
   - `outputs/{model}/latest/` β€” The most recent epoch (use for debugging or continuing training)

### Key hyperparameters

| | BLIP | ViT-GPT2 | GIT | Custom VLM |
|-|---|---|---|---|
| Training epochs | 3 | 3 | 3 | 15 |
| Learning rate | 1e-5 | 2e-5 | 2e-5 | 1e-4 / 5e-5 |
| Batch size | 16 | 8 | 8 | 16 |
| Effective batch size | 64 | 32 | 32 | 64 |
| Training images | 30,000 | 15,000 | 15,000 | 15,000 |

---

## Evaluation

### Basic evaluation

```bash
# Evaluate a single model (computes CIDEr score)
python eval.py --model blip --weights best

# Evaluate with pre-trained weights (no fine-tuning)
python eval.py --model blip --weights base

# Compare all models side by side
python eval.py --model all --weights best
```

### Experiments

```bash
# Cross-attention masking experiment: what happens when we hide parts of the image?
python eval.py --model blip --ablation --weights best

# Decoding parameter sweep: find the best beam search settings
python eval.py --model blip --sweep --weights best

# Caption filtering analysis: does training data quality matter?
python eval.py --model blip --data-prep-analysis --weights best
```

### Custom decoding settings

```bash
python eval.py --model blip --weights best \
    --num_beams 10 \
    --max_new_tokens 50 \
    --length_penalty 1.2
```

### All command-line options

| Flag | Values | Default | What It Controls |
|---|---|---|---|
| `--model` | blip, vit_gpt2, git, custom, all | blip | Which model(s) to evaluate |
| `--weights` | base, finetuned, best | base | Which checkpoint to load |
| `--eval_batches` | any integer | 25 | How many validation batches to evaluate |
| `--num_beams` | 1–10+ | 10 | Beam search width (more = better but slower) |
| `--max_new_tokens` | 10–100 | 50 | Maximum caption length |
| `--length_penalty` | 0.5–2.0 | 1.2 | < 1.0 = longer captions, > 1.0 = shorter |
| `--ablation` | flag | off | Run the cross-attention masking experiment |
| `--sweep` | flag | off | Run the decoding parameter sweep |
| `--data-prep-analysis` | flag | off | Run the caption filtering comparison |

---

## Streamlit Demo

```bash
streamlit run app.py
```

The demo provides three tabs:

### πŸ–ΌοΈ Caption Tab
Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.

### πŸ“Š Compare All Models Tab
Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.

### πŸ“ˆ Experiment Results Tab
Browse pre-computed results from all three experiments.

### Sidebar Controls
- **Weight Source** β€” Switch between pre-trained models and your fine-tuned checkpoints
- **Architecture** β€” Select any of the four models (each has an info card explaining its approach)
- **Generation Mode** β€” Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
- **Advanced Controls** β€” Adjust beam width, temperature, length penalty, top-k, and top-p

> **Safety:** All captions pass through a toxicity filter (`detoxify`) before being displayed.

---

## Configuration

Hyperparameters are managed through Python dataclasses in `configs/`:

```
configs/
β”œβ”€β”€ base_config.py          # Shared defaults (batch size, image size, optimizer settings)
β”œβ”€β”€ blip_config.py          # BLIP-specific overrides
β”œβ”€β”€ vit_gpt2_config.py      # ViT-GPT2-specific overrides
β”œβ”€β”€ git_config.py           # GIT-specific overrides
└── custom_vlm_config.py    # Custom VLM overrides (decoder architecture, learning rates)
```

Access any config in code:

```python
from configs import get_config
cfg = get_config("blip")  # Returns BlipConfig instance with all settings
```

---

## Experiments & Key Results

### 1. Cross-Attention Masking: What Happens When We Hide Image Patches?

| What We Did | CIDEr Score | Change |
|---|---|---|
| Showed the full image | 0.5371 | β€” Baseline |
| Hid 50% of image patches randomly | 0.5371 | **No change** |
| Showed only the center of the image | 0.5371 | **No change** |
| Compressed entire image to 1 token | 0.0008 | **βˆ’99.8%** |

**Takeaway:** Half the image patches are redundant, but spatial structure is essential.

### 2. Beam Search Settings: What Produces the Best Captions?

**Best configuration found:** beam_size=10, length_penalty=1.2, max_tokens=50 β†’ **CIDEr: 0.6199**

More beams and slight preference for conciseness improve caption quality by ~13%.

### 3. Caption Filtering: Does Training Data Quality Matter?

| Strategy | CIDEr |
|---|---|
| Raw (no filtering) | **0.6359** |
| Filtered (5–25 words) | 0.5877 |

Raw works best for this already-clean dataset. Filtering recommended for noisier data.

---

## Project Structure

```
project_02/
β”œβ”€β”€ app.py                              # Streamlit web demo (3 tabs)
β”œβ”€β”€ config.py                           # Backward-compatible config wrapper
β”œβ”€β”€ data_prep.py                        # Dataset loading + caption filtering
β”œβ”€β”€ eval.py                             # CIDEr evaluator + experiment runner
β”œβ”€β”€ train.py                            # Unified training loop for all 4 models
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”œβ”€β”€ input.txt                           # Shakespeare corpus (vocabulary source)
β”œβ”€β”€ shakespeare_transformer.pt          # Pre-trained Shakespeare decoder weights
β”‚
β”œβ”€β”€ configs/                            # Hyperparameter configs
β”‚   β”œβ”€β”€ base_config.py                  # Shared defaults
β”‚   β”œβ”€β”€ blip_config.py                  # BLIP settings
β”‚   β”œβ”€β”€ vit_gpt2_config.py             # ViT-GPT2 settings
β”‚   β”œβ”€β”€ git_config.py                   # GIT settings
β”‚   └── custom_vlm_config.py            # Custom VLM settings
β”‚
β”œβ”€β”€ models/                             # Model implementations
β”‚   β”œβ”€β”€ blip_tuner.py                   # BLIP (gated cross-attention)
β”‚   β”œβ”€β”€ vit_gpt2_tuner.py              # ViT-GPT2 (full cross-attention)
β”‚   β”œβ”€β”€ git_tuner.py                    # GIT (no cross-attention)
β”‚   └── custom_vlm.py                  # Custom VLM (visual prefix-tuning)
β”‚
β”œβ”€β”€ experiments/                        # Experiment scripts and results
β”‚   β”œβ”€β”€ ablation_study.py              # Image masking experiment
β”‚   β”œβ”€β”€ parameter_sweep.py             # Beam search settings sweep
β”‚   β”œβ”€β”€ data_prep_analysis.py          # Caption filtering comparison
β”‚   └── cross_attention_patterns.py    # Architecture comparison table
β”‚
β”œβ”€β”€ outputs/                            # Saved model checkpoints
β”‚   β”œβ”€β”€ blip/{best,latest}/
β”‚   └── custom_vlm/{best,latest}/
β”‚
β”œβ”€β”€ detailed_technical_report_cross_attention_vlm_image_captioning.md
β”œβ”€β”€ simplified_overview_vlm_image_captioning_project.md
└── README.md                           # This file
```

---

## Tech Stack

| Component | Technology |
|---|---|
| Training Framework | PyTorch + HuggingFace Transformers |
| Dataset | MS-COCO Captions (via HuggingFace Datasets) |
| Evaluation Metric | CIDEr (via pycocoevalcap) |
| Safety Filter | detoxify (toxicity detection) |
| Web Demo | Streamlit |
| Hardware | Apple Silicon Mac with MPS acceleration |

---

## Author

**Manoj Kumar** β€” March 2026