Spaces:
Sleeping
Sleeping
File size: 12,834 Bytes
82bedcd c374021 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 | ---
title: VLM Caption Lab
emoji: πΌοΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.0
app_file: app.py
pinned: false
---
# π¬ VLM Caption Lab
**Compare how different Vision-Language Models look at images while writing captions β four architectures, one dataset, one evaluation metric.**
VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to **image captioning** (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.
---
## Architecture Comparison
| Architecture | How It Looks at the Image | Total Parameters | Best CIDEr Score |
|---|---|---|---|
| **BLIP** | Selective gated attention β looks at image only when needed | 224M | **0.6199** (optimized) |
| **ViT-GPT2** | Full attention β looks at entire image for every word | 239M | ~0.55 |
| **GIT** | Memory-based β memorizes image first, writes from memory | 177M | ~0.54 |
| **Custom VLM** | Built from scratch β Shakespeare decoder + visual bridge | 103M (16.2M trainable) | **0.2863** |
> **What is CIDEr?** CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.
---
## π Live Demo & Deployment
**The easiest way to test this project is via the live web demo.**
> π **[Insert Your Live Hosted Link Here]**
*(If deploying yourself, see the `DEPLOYMENT_GUIDE.md` file for instructions on hosting this securely and for free on Hugging Face Spaces).*
---
## Quick Start (Local Run)
If you prefer to run this locally rather than using the web demo, follow these steps.
> β οΈ **Note on Weights**: You do *not* need to train the models yourself to test the app.
> - Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.
> - The Custom VLM text-decoder weights (`shakespeare_transformer.pt`) are included in this repo.
> - **To skip training completely**, you only need to run `streamlit run app.py`!
### Prerequisites
- Python 3.9 or newer
- macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
- ~8 GB disk space for model checkpoints
### Setup
```bash
# Clone the repository
git clone <repo-url>
cd project_02
# Create a virtual environment
python -m venv venv
source venv/bin/activate
# Install all dependencies
pip install -r requirements.txt
# Verify that GPU acceleration is available
python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"
```
### Dependencies
| Package | What It Does |
|---|---|
| `torch` | Deep learning framework (training and inference) |
| `transformers` | Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace |
| `datasets` | Download and load MS-COCO caption dataset from HuggingFace |
| `streamlit` | Interactive web demo interface |
| `pycocoevalcap` | Compute CIDEr scores (caption quality metric) |
| `detoxify` | Safety filter β checks captions for toxic or offensive content |
| `Pillow` | Image loading and processing |
| `accelerate` | Training efficiency utilities |
---
## π What to Expect on First Run
When someone clones this repository and runs `streamlit run app.py` (or `train.py`) for the very first time, here is exactly what happens:
1. **Automatic Model Downloads**: You do *not* need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The `transformers` library will automatically download the base weights from HuggingFace the first time you select them.
2. **Download Time**: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
3. **Custom VLM Weights**: The `shakespeare_transformer.pt` file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading.
4. **Fine-Tuned Weights**: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (`python train.py --model [name]`). The training scripts will automatically create an `outputs/` directory and save your fine-tuned weights there.
---
## Training
All four models are trained through one unified script:
```bash
# Train individual models
python train.py --model blip # ~1.5 hours on Apple Silicon
python train.py --model vit_gpt2 # ~1 hour
python train.py --model git # ~20 minutes
python train.py --model custom # ~3 hours (15 epochs)
```
### What happens during training
1. **Dataset loading** β Downloads MS-COCO captions from HuggingFace (cached after first download)
2. **Training** β Images are processed by the vision encoder, captions by the text decoder
3. **Validation** β After each epoch, computes validation loss + CIDEr score on held-out images
4. **Checkpointing** β Saves two checkpoints:
- `outputs/{model}/best/` β The model with the **highest CIDEr score** (use this for evaluation)
- `outputs/{model}/latest/` β The most recent epoch (use for debugging or continuing training)
### Key hyperparameters
| | BLIP | ViT-GPT2 | GIT | Custom VLM |
|-|---|---|---|---|
| Training epochs | 3 | 3 | 3 | 15 |
| Learning rate | 1e-5 | 2e-5 | 2e-5 | 1e-4 / 5e-5 |
| Batch size | 16 | 8 | 8 | 16 |
| Effective batch size | 64 | 32 | 32 | 64 |
| Training images | 30,000 | 15,000 | 15,000 | 15,000 |
---
## Evaluation
### Basic evaluation
```bash
# Evaluate a single model (computes CIDEr score)
python eval.py --model blip --weights best
# Evaluate with pre-trained weights (no fine-tuning)
python eval.py --model blip --weights base
# Compare all models side by side
python eval.py --model all --weights best
```
### Experiments
```bash
# Cross-attention masking experiment: what happens when we hide parts of the image?
python eval.py --model blip --ablation --weights best
# Decoding parameter sweep: find the best beam search settings
python eval.py --model blip --sweep --weights best
# Caption filtering analysis: does training data quality matter?
python eval.py --model blip --data-prep-analysis --weights best
```
### Custom decoding settings
```bash
python eval.py --model blip --weights best \
--num_beams 10 \
--max_new_tokens 50 \
--length_penalty 1.2
```
### All command-line options
| Flag | Values | Default | What It Controls |
|---|---|---|---|
| `--model` | blip, vit_gpt2, git, custom, all | blip | Which model(s) to evaluate |
| `--weights` | base, finetuned, best | base | Which checkpoint to load |
| `--eval_batches` | any integer | 25 | How many validation batches to evaluate |
| `--num_beams` | 1β10+ | 10 | Beam search width (more = better but slower) |
| `--max_new_tokens` | 10β100 | 50 | Maximum caption length |
| `--length_penalty` | 0.5β2.0 | 1.2 | < 1.0 = longer captions, > 1.0 = shorter |
| `--ablation` | flag | off | Run the cross-attention masking experiment |
| `--sweep` | flag | off | Run the decoding parameter sweep |
| `--data-prep-analysis` | flag | off | Run the caption filtering comparison |
---
## Streamlit Demo
```bash
streamlit run app.py
```
The demo provides three tabs:
### πΌοΈ Caption Tab
Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.
### π Compare All Models Tab
Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.
### π Experiment Results Tab
Browse pre-computed results from all three experiments.
### Sidebar Controls
- **Weight Source** β Switch between pre-trained models and your fine-tuned checkpoints
- **Architecture** β Select any of the four models (each has an info card explaining its approach)
- **Generation Mode** β Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
- **Advanced Controls** β Adjust beam width, temperature, length penalty, top-k, and top-p
> **Safety:** All captions pass through a toxicity filter (`detoxify`) before being displayed.
---
## Configuration
Hyperparameters are managed through Python dataclasses in `configs/`:
```
configs/
βββ base_config.py # Shared defaults (batch size, image size, optimizer settings)
βββ blip_config.py # BLIP-specific overrides
βββ vit_gpt2_config.py # ViT-GPT2-specific overrides
βββ git_config.py # GIT-specific overrides
βββ custom_vlm_config.py # Custom VLM overrides (decoder architecture, learning rates)
```
Access any config in code:
```python
from configs import get_config
cfg = get_config("blip") # Returns BlipConfig instance with all settings
```
---
## Experiments & Key Results
### 1. Cross-Attention Masking: What Happens When We Hide Image Patches?
| What We Did | CIDEr Score | Change |
|---|---|---|
| Showed the full image | 0.5371 | β Baseline |
| Hid 50% of image patches randomly | 0.5371 | **No change** |
| Showed only the center of the image | 0.5371 | **No change** |
| Compressed entire image to 1 token | 0.0008 | **β99.8%** |
**Takeaway:** Half the image patches are redundant, but spatial structure is essential.
### 2. Beam Search Settings: What Produces the Best Captions?
**Best configuration found:** beam_size=10, length_penalty=1.2, max_tokens=50 β **CIDEr: 0.6199**
More beams and slight preference for conciseness improve caption quality by ~13%.
### 3. Caption Filtering: Does Training Data Quality Matter?
| Strategy | CIDEr |
|---|---|
| Raw (no filtering) | **0.6359** |
| Filtered (5β25 words) | 0.5877 |
Raw works best for this already-clean dataset. Filtering recommended for noisier data.
---
## Project Structure
```
project_02/
βββ app.py # Streamlit web demo (3 tabs)
βββ config.py # Backward-compatible config wrapper
βββ data_prep.py # Dataset loading + caption filtering
βββ eval.py # CIDEr evaluator + experiment runner
βββ train.py # Unified training loop for all 4 models
βββ requirements.txt # Python dependencies
βββ input.txt # Shakespeare corpus (vocabulary source)
βββ shakespeare_transformer.pt # Pre-trained Shakespeare decoder weights
β
βββ configs/ # Hyperparameter configs
β βββ base_config.py # Shared defaults
β βββ blip_config.py # BLIP settings
β βββ vit_gpt2_config.py # ViT-GPT2 settings
β βββ git_config.py # GIT settings
β βββ custom_vlm_config.py # Custom VLM settings
β
βββ models/ # Model implementations
β βββ blip_tuner.py # BLIP (gated cross-attention)
β βββ vit_gpt2_tuner.py # ViT-GPT2 (full cross-attention)
β βββ git_tuner.py # GIT (no cross-attention)
β βββ custom_vlm.py # Custom VLM (visual prefix-tuning)
β
βββ experiments/ # Experiment scripts and results
β βββ ablation_study.py # Image masking experiment
β βββ parameter_sweep.py # Beam search settings sweep
β βββ data_prep_analysis.py # Caption filtering comparison
β βββ cross_attention_patterns.py # Architecture comparison table
β
βββ outputs/ # Saved model checkpoints
β βββ blip/{best,latest}/
β βββ custom_vlm/{best,latest}/
β
βββ detailed_technical_report_cross_attention_vlm_image_captioning.md
βββ simplified_overview_vlm_image_captioning_project.md
βββ README.md # This file
```
---
## Tech Stack
| Component | Technology |
|---|---|
| Training Framework | PyTorch + HuggingFace Transformers |
| Dataset | MS-COCO Captions (via HuggingFace Datasets) |
| Evaluation Metric | CIDEr (via pycocoevalcap) |
| Safety Filter | detoxify (toxicity detection) |
| Web Demo | Streamlit |
| Hardware | Apple Silicon Mac with MPS acceleration |
---
## Author
**Manoj Kumar** β March 2026
|