Spaces:
Running
A newer version of the Streamlit SDK is available:
1.55.0
title: VLM Caption Lab
emoji: πΌοΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.0
app_file: app.py
pinned: false
π¬ VLM Caption Lab
Compare how different Vision-Language Models look at images while writing captions β four architectures, one dataset, one evaluation metric.
VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to image captioning (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.
Architecture Comparison
| Architecture | How It Looks at the Image | Total Parameters | Best CIDEr Score |
|---|---|---|---|
| BLIP | Selective gated attention β looks at image only when needed | 224M | 0.6199 (optimized) |
| ViT-GPT2 | Full attention β looks at entire image for every word | 239M | ~0.55 |
| GIT | Memory-based β memorizes image first, writes from memory | 177M | ~0.54 |
| Custom VLM | Built from scratch β Shakespeare decoder + visual bridge | 103M (16.2M trainable) | 0.2863 |
What is CIDEr? CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.
π Live Demo & Deployment
The easiest way to test this project is via the live web demo.
π [Insert Your Live Hosted Link Here]
(If deploying yourself, see the DEPLOYMENT_GUIDE.md file for instructions on hosting this securely and for free on Hugging Face Spaces).
Quick Start (Local Run)
If you prefer to run this locally rather than using the web demo, follow these steps.
β οΈ Note on Weights: You do not need to train the models yourself to test the app.
- Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.
- The Custom VLM text-decoder weights (
shakespeare_transformer.pt) are included in this repo.- To skip training completely, you only need to run
streamlit run app.py!
Prerequisites
- Python 3.9 or newer
- macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
- ~8 GB disk space for model checkpoints
Setup
# Clone the repository
git clone <repo-url>
cd project_02
# Create a virtual environment
python -m venv venv
source venv/bin/activate
# Install all dependencies
pip install -r requirements.txt
# Verify that GPU acceleration is available
python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"
Dependencies
| Package | What It Does |
|---|---|
torch |
Deep learning framework (training and inference) |
transformers |
Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace |
datasets |
Download and load MS-COCO caption dataset from HuggingFace |
streamlit |
Interactive web demo interface |
pycocoevalcap |
Compute CIDEr scores (caption quality metric) |
detoxify |
Safety filter β checks captions for toxic or offensive content |
Pillow |
Image loading and processing |
accelerate |
Training efficiency utilities |
π What to Expect on First Run
When someone clones this repository and runs streamlit run app.py (or train.py) for the very first time, here is exactly what happens:
- Automatic Model Downloads: You do not need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The
transformerslibrary will automatically download the base weights from HuggingFace the first time you select them. - Download Time: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
- Custom VLM Weights: The
shakespeare_transformer.ptfile (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading. - Fine-Tuned Weights: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (
python train.py --model [name]). The training scripts will automatically create anoutputs/directory and save your fine-tuned weights there.
Training
All four models are trained through one unified script:
# Train individual models
python train.py --model blip # ~1.5 hours on Apple Silicon
python train.py --model vit_gpt2 # ~1 hour
python train.py --model git # ~20 minutes
python train.py --model custom # ~3 hours (15 epochs)
What happens during training
- Dataset loading β Downloads MS-COCO captions from HuggingFace (cached after first download)
- Training β Images are processed by the vision encoder, captions by the text decoder
- Validation β After each epoch, computes validation loss + CIDEr score on held-out images
- Checkpointing β Saves two checkpoints:
outputs/{model}/best/β The model with the highest CIDEr score (use this for evaluation)outputs/{model}/latest/β The most recent epoch (use for debugging or continuing training)
Key hyperparameters
| BLIP | ViT-GPT2 | GIT | Custom VLM | |
|---|---|---|---|---|
| Training epochs | 3 | 3 | 3 | 15 |
| Learning rate | 1e-5 | 2e-5 | 2e-5 | 1e-4 / 5e-5 |
| Batch size | 16 | 8 | 8 | 16 |
| Effective batch size | 64 | 32 | 32 | 64 |
| Training images | 30,000 | 15,000 | 15,000 | 15,000 |
Evaluation
Basic evaluation
# Evaluate a single model (computes CIDEr score)
python eval.py --model blip --weights best
# Evaluate with pre-trained weights (no fine-tuning)
python eval.py --model blip --weights base
# Compare all models side by side
python eval.py --model all --weights best
Experiments
# Cross-attention masking experiment: what happens when we hide parts of the image?
python eval.py --model blip --ablation --weights best
# Decoding parameter sweep: find the best beam search settings
python eval.py --model blip --sweep --weights best
# Caption filtering analysis: does training data quality matter?
python eval.py --model blip --data-prep-analysis --weights best
Custom decoding settings
python eval.py --model blip --weights best \
--num_beams 10 \
--max_new_tokens 50 \
--length_penalty 1.2
All command-line options
| Flag | Values | Default | What It Controls |
|---|---|---|---|
--model |
blip, vit_gpt2, git, custom, all | blip | Which model(s) to evaluate |
--weights |
base, finetuned, best | base | Which checkpoint to load |
--eval_batches |
any integer | 25 | How many validation batches to evaluate |
--num_beams |
1β10+ | 10 | Beam search width (more = better but slower) |
--max_new_tokens |
10β100 | 50 | Maximum caption length |
--length_penalty |
0.5β2.0 | 1.2 | < 1.0 = longer captions, > 1.0 = shorter |
--ablation |
flag | off | Run the cross-attention masking experiment |
--sweep |
flag | off | Run the decoding parameter sweep |
--data-prep-analysis |
flag | off | Run the caption filtering comparison |
Streamlit Demo
streamlit run app.py
The demo provides three tabs:
πΌοΈ Caption Tab
Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.
π Compare All Models Tab
Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.
π Experiment Results Tab
Browse pre-computed results from all three experiments.
Sidebar Controls
- Weight Source β Switch between pre-trained models and your fine-tuned checkpoints
- Architecture β Select any of the four models (each has an info card explaining its approach)
- Generation Mode β Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
- Advanced Controls β Adjust beam width, temperature, length penalty, top-k, and top-p
Safety: All captions pass through a toxicity filter (
detoxify) before being displayed.
Configuration
Hyperparameters are managed through Python dataclasses in configs/:
configs/
βββ base_config.py # Shared defaults (batch size, image size, optimizer settings)
βββ blip_config.py # BLIP-specific overrides
βββ vit_gpt2_config.py # ViT-GPT2-specific overrides
βββ git_config.py # GIT-specific overrides
βββ custom_vlm_config.py # Custom VLM overrides (decoder architecture, learning rates)
Access any config in code:
from configs import get_config
cfg = get_config("blip") # Returns BlipConfig instance with all settings
Experiments & Key Results
1. Cross-Attention Masking: What Happens When We Hide Image Patches?
| What We Did | CIDEr Score | Change |
|---|---|---|
| Showed the full image | 0.5371 | β Baseline |
| Hid 50% of image patches randomly | 0.5371 | No change |
| Showed only the center of the image | 0.5371 | No change |
| Compressed entire image to 1 token | 0.0008 | β99.8% |
Takeaway: Half the image patches are redundant, but spatial structure is essential.
2. Beam Search Settings: What Produces the Best Captions?
Best configuration found: beam_size=10, length_penalty=1.2, max_tokens=50 β CIDEr: 0.6199
More beams and slight preference for conciseness improve caption quality by ~13%.
3. Caption Filtering: Does Training Data Quality Matter?
| Strategy | CIDEr |
|---|---|
| Raw (no filtering) | 0.6359 |
| Filtered (5β25 words) | 0.5877 |
Raw works best for this already-clean dataset. Filtering recommended for noisier data.
Project Structure
project_02/
βββ app.py # Streamlit web demo (3 tabs)
βββ config.py # Backward-compatible config wrapper
βββ data_prep.py # Dataset loading + caption filtering
βββ eval.py # CIDEr evaluator + experiment runner
βββ train.py # Unified training loop for all 4 models
βββ requirements.txt # Python dependencies
βββ input.txt # Shakespeare corpus (vocabulary source)
βββ shakespeare_transformer.pt # Pre-trained Shakespeare decoder weights
β
βββ configs/ # Hyperparameter configs
β βββ base_config.py # Shared defaults
β βββ blip_config.py # BLIP settings
β βββ vit_gpt2_config.py # ViT-GPT2 settings
β βββ git_config.py # GIT settings
β βββ custom_vlm_config.py # Custom VLM settings
β
βββ models/ # Model implementations
β βββ blip_tuner.py # BLIP (gated cross-attention)
β βββ vit_gpt2_tuner.py # ViT-GPT2 (full cross-attention)
β βββ git_tuner.py # GIT (no cross-attention)
β βββ custom_vlm.py # Custom VLM (visual prefix-tuning)
β
βββ experiments/ # Experiment scripts and results
β βββ ablation_study.py # Image masking experiment
β βββ parameter_sweep.py # Beam search settings sweep
β βββ data_prep_analysis.py # Caption filtering comparison
β βββ cross_attention_patterns.py # Architecture comparison table
β
βββ outputs/ # Saved model checkpoints
β βββ blip/{best,latest}/
β βββ custom_vlm/{best,latest}/
β
βββ detailed_technical_report_cross_attention_vlm_image_captioning.md
βββ simplified_overview_vlm_image_captioning_project.md
βββ README.md # This file
Tech Stack
| Component | Technology |
|---|---|
| Training Framework | PyTorch + HuggingFace Transformers |
| Dataset | MS-COCO Captions (via HuggingFace Datasets) |
| Evaluation Metric | CIDEr (via pycocoevalcap) |
| Safety Filter | detoxify (toxicity detection) |
| Web Demo | Streamlit |
| Hardware | Apple Silicon Mac with MPS acceleration |
Author
Manoj Kumar β March 2026