project_02_DS / README.md
griddev's picture
Add Hugging Face Spaces YAML frontmatter to README
82bedcd

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade
metadata
title: VLM Caption Lab
emoji: πŸ–ΌοΈ
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.0
app_file: app.py
pinned: false

πŸ”¬ VLM Caption Lab

Compare how different Vision-Language Models look at images while writing captions β€” four architectures, one dataset, one evaluation metric.

VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to image captioning (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.


Architecture Comparison

Architecture How It Looks at the Image Total Parameters Best CIDEr Score
BLIP Selective gated attention β€” looks at image only when needed 224M 0.6199 (optimized)
ViT-GPT2 Full attention β€” looks at entire image for every word 239M ~0.55
GIT Memory-based β€” memorizes image first, writes from memory 177M ~0.54
Custom VLM Built from scratch β€” Shakespeare decoder + visual bridge 103M (16.2M trainable) 0.2863

What is CIDEr? CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.


🌐 Live Demo & Deployment

The easiest way to test this project is via the live web demo.

πŸ‘‰ [Insert Your Live Hosted Link Here]

(If deploying yourself, see the DEPLOYMENT_GUIDE.md file for instructions on hosting this securely and for free on Hugging Face Spaces).


Quick Start (Local Run)

If you prefer to run this locally rather than using the web demo, follow these steps.

⚠️ Note on Weights: You do not need to train the models yourself to test the app.

  • Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.
  • The Custom VLM text-decoder weights (shakespeare_transformer.pt) are included in this repo.
  • To skip training completely, you only need to run streamlit run app.py!

Prerequisites

  • Python 3.9 or newer
  • macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
  • ~8 GB disk space for model checkpoints

Setup

# Clone the repository
git clone <repo-url>
cd project_02

# Create a virtual environment
python -m venv venv
source venv/bin/activate

# Install all dependencies
pip install -r requirements.txt

# Verify that GPU acceleration is available
python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"

Dependencies

Package What It Does
torch Deep learning framework (training and inference)
transformers Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace
datasets Download and load MS-COCO caption dataset from HuggingFace
streamlit Interactive web demo interface
pycocoevalcap Compute CIDEr scores (caption quality metric)
detoxify Safety filter β€” checks captions for toxic or offensive content
Pillow Image loading and processing
accelerate Training efficiency utilities

πŸš€ What to Expect on First Run

When someone clones this repository and runs streamlit run app.py (or train.py) for the very first time, here is exactly what happens:

  1. Automatic Model Downloads: You do not need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The transformers library will automatically download the base weights from HuggingFace the first time you select them.
  2. Download Time: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
  3. Custom VLM Weights: The shakespeare_transformer.pt file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading.
  4. Fine-Tuned Weights: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (python train.py --model [name]). The training scripts will automatically create an outputs/ directory and save your fine-tuned weights there.

Training

All four models are trained through one unified script:

# Train individual models
python train.py --model blip          # ~1.5 hours on Apple Silicon
python train.py --model vit_gpt2      # ~1 hour
python train.py --model git           # ~20 minutes
python train.py --model custom        # ~3 hours (15 epochs)

What happens during training

  1. Dataset loading β€” Downloads MS-COCO captions from HuggingFace (cached after first download)
  2. Training β€” Images are processed by the vision encoder, captions by the text decoder
  3. Validation β€” After each epoch, computes validation loss + CIDEr score on held-out images
  4. Checkpointing β€” Saves two checkpoints:
    • outputs/{model}/best/ β€” The model with the highest CIDEr score (use this for evaluation)
    • outputs/{model}/latest/ β€” The most recent epoch (use for debugging or continuing training)

Key hyperparameters

BLIP ViT-GPT2 GIT Custom VLM
Training epochs 3 3 3 15
Learning rate 1e-5 2e-5 2e-5 1e-4 / 5e-5
Batch size 16 8 8 16
Effective batch size 64 32 32 64
Training images 30,000 15,000 15,000 15,000

Evaluation

Basic evaluation

# Evaluate a single model (computes CIDEr score)
python eval.py --model blip --weights best

# Evaluate with pre-trained weights (no fine-tuning)
python eval.py --model blip --weights base

# Compare all models side by side
python eval.py --model all --weights best

Experiments

# Cross-attention masking experiment: what happens when we hide parts of the image?
python eval.py --model blip --ablation --weights best

# Decoding parameter sweep: find the best beam search settings
python eval.py --model blip --sweep --weights best

# Caption filtering analysis: does training data quality matter?
python eval.py --model blip --data-prep-analysis --weights best

Custom decoding settings

python eval.py --model blip --weights best \
    --num_beams 10 \
    --max_new_tokens 50 \
    --length_penalty 1.2

All command-line options

Flag Values Default What It Controls
--model blip, vit_gpt2, git, custom, all blip Which model(s) to evaluate
--weights base, finetuned, best base Which checkpoint to load
--eval_batches any integer 25 How many validation batches to evaluate
--num_beams 1–10+ 10 Beam search width (more = better but slower)
--max_new_tokens 10–100 50 Maximum caption length
--length_penalty 0.5–2.0 1.2 < 1.0 = longer captions, > 1.0 = shorter
--ablation flag off Run the cross-attention masking experiment
--sweep flag off Run the decoding parameter sweep
--data-prep-analysis flag off Run the caption filtering comparison

Streamlit Demo

streamlit run app.py

The demo provides three tabs:

πŸ–ΌοΈ Caption Tab

Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.

πŸ“Š Compare All Models Tab

Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.

πŸ“ˆ Experiment Results Tab

Browse pre-computed results from all three experiments.

Sidebar Controls

  • Weight Source β€” Switch between pre-trained models and your fine-tuned checkpoints
  • Architecture β€” Select any of the four models (each has an info card explaining its approach)
  • Generation Mode β€” Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
  • Advanced Controls β€” Adjust beam width, temperature, length penalty, top-k, and top-p

Safety: All captions pass through a toxicity filter (detoxify) before being displayed.


Configuration

Hyperparameters are managed through Python dataclasses in configs/:

configs/
β”œβ”€β”€ base_config.py          # Shared defaults (batch size, image size, optimizer settings)
β”œβ”€β”€ blip_config.py          # BLIP-specific overrides
β”œβ”€β”€ vit_gpt2_config.py      # ViT-GPT2-specific overrides
β”œβ”€β”€ git_config.py           # GIT-specific overrides
└── custom_vlm_config.py    # Custom VLM overrides (decoder architecture, learning rates)

Access any config in code:

from configs import get_config
cfg = get_config("blip")  # Returns BlipConfig instance with all settings

Experiments & Key Results

1. Cross-Attention Masking: What Happens When We Hide Image Patches?

What We Did CIDEr Score Change
Showed the full image 0.5371 β€” Baseline
Hid 50% of image patches randomly 0.5371 No change
Showed only the center of the image 0.5371 No change
Compressed entire image to 1 token 0.0008 βˆ’99.8%

Takeaway: Half the image patches are redundant, but spatial structure is essential.

2. Beam Search Settings: What Produces the Best Captions?

Best configuration found: beam_size=10, length_penalty=1.2, max_tokens=50 β†’ CIDEr: 0.6199

More beams and slight preference for conciseness improve caption quality by ~13%.

3. Caption Filtering: Does Training Data Quality Matter?

Strategy CIDEr
Raw (no filtering) 0.6359
Filtered (5–25 words) 0.5877

Raw works best for this already-clean dataset. Filtering recommended for noisier data.


Project Structure

project_02/
β”œβ”€β”€ app.py                              # Streamlit web demo (3 tabs)
β”œβ”€β”€ config.py                           # Backward-compatible config wrapper
β”œβ”€β”€ data_prep.py                        # Dataset loading + caption filtering
β”œβ”€β”€ eval.py                             # CIDEr evaluator + experiment runner
β”œβ”€β”€ train.py                            # Unified training loop for all 4 models
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”œβ”€β”€ input.txt                           # Shakespeare corpus (vocabulary source)
β”œβ”€β”€ shakespeare_transformer.pt          # Pre-trained Shakespeare decoder weights
β”‚
β”œβ”€β”€ configs/                            # Hyperparameter configs
β”‚   β”œβ”€β”€ base_config.py                  # Shared defaults
β”‚   β”œβ”€β”€ blip_config.py                  # BLIP settings
β”‚   β”œβ”€β”€ vit_gpt2_config.py             # ViT-GPT2 settings
β”‚   β”œβ”€β”€ git_config.py                   # GIT settings
β”‚   └── custom_vlm_config.py            # Custom VLM settings
β”‚
β”œβ”€β”€ models/                             # Model implementations
β”‚   β”œβ”€β”€ blip_tuner.py                   # BLIP (gated cross-attention)
β”‚   β”œβ”€β”€ vit_gpt2_tuner.py              # ViT-GPT2 (full cross-attention)
β”‚   β”œβ”€β”€ git_tuner.py                    # GIT (no cross-attention)
β”‚   └── custom_vlm.py                  # Custom VLM (visual prefix-tuning)
β”‚
β”œβ”€β”€ experiments/                        # Experiment scripts and results
β”‚   β”œβ”€β”€ ablation_study.py              # Image masking experiment
β”‚   β”œβ”€β”€ parameter_sweep.py             # Beam search settings sweep
β”‚   β”œβ”€β”€ data_prep_analysis.py          # Caption filtering comparison
β”‚   └── cross_attention_patterns.py    # Architecture comparison table
β”‚
β”œβ”€β”€ outputs/                            # Saved model checkpoints
β”‚   β”œβ”€β”€ blip/{best,latest}/
β”‚   └── custom_vlm/{best,latest}/
β”‚
β”œβ”€β”€ detailed_technical_report_cross_attention_vlm_image_captioning.md
β”œβ”€β”€ simplified_overview_vlm_image_captioning_project.md
└── README.md                           # This file

Tech Stack

Component Technology
Training Framework PyTorch + HuggingFace Transformers
Dataset MS-COCO Captions (via HuggingFace Datasets)
Evaluation Metric CIDEr (via pycocoevalcap)
Safety Filter detoxify (toxicity detection)
Web Demo Streamlit
Hardware Apple Silicon Mac with MPS acceleration

Author

Manoj Kumar β€” March 2026