--- title: VLM Caption Lab emoji: 🖼️ colorFrom: blue colorTo: indigo sdk: streamlit sdk_version: 1.40.0 app_file: app.py pinned: false --- # 🔬 VLM Caption Lab **Compare how different Vision-Language Models look at images while writing captions — four architectures, one dataset, one evaluation metric.** VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to **image captioning** (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo. --- ## Architecture Comparison | Architecture | How It Looks at the Image | Total Parameters | Best CIDEr Score | |---|---|---|---| | **BLIP** | Selective gated attention — looks at image only when needed | 224M | **0.6199** (optimized) | | **ViT-GPT2** | Full attention — looks at entire image for every word | 239M | ~0.55 | | **GIT** | Memory-based — memorizes image first, writes from memory | 177M | ~0.54 | | **Custom VLM** | Built from scratch — Shakespeare decoder + visual bridge | 103M (16.2M trainable) | **0.2863** | > **What is CIDEr?** CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references. --- ## 🌐 Live Demo & Deployment **The easiest way to test this project is via the live web demo.** > 👉 **[Insert Your Live Hosted Link Here]** *(If deploying yourself, see the `DEPLOYMENT_GUIDE.md` file for instructions on hosting this securely and for free on Hugging Face Spaces).* --- ## Quick Start (Local Run) If you prefer to run this locally rather than using the web demo, follow these steps. > ⚠️ **Note on Weights**: You do *not* need to train the models yourself to test the app. > - Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run. > - The Custom VLM text-decoder weights (`shakespeare_transformer.pt`) are included in this repo. > - **To skip training completely**, you only need to run `streamlit run app.py`! ### Prerequisites - Python 3.9 or newer - macOS with Apple Silicon (MPS) or Linux with a CUDA GPU - ~8 GB disk space for model checkpoints ### Setup ```bash # Clone the repository git clone cd project_02 # Create a virtual environment python -m venv venv source venv/bin/activate # Install all dependencies pip install -r requirements.txt # Verify that GPU acceleration is available python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())" ``` ### Dependencies | Package | What It Does | |---|---| | `torch` | Deep learning framework (training and inference) | | `transformers` | Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace | | `datasets` | Download and load MS-COCO caption dataset from HuggingFace | | `streamlit` | Interactive web demo interface | | `pycocoevalcap` | Compute CIDEr scores (caption quality metric) | | `detoxify` | Safety filter — checks captions for toxic or offensive content | | `Pillow` | Image loading and processing | | `accelerate` | Training efficiency utilities | --- ## 🚀 What to Expect on First Run When someone clones this repository and runs `streamlit run app.py` (or `train.py`) for the very first time, here is exactly what happens: 1. **Automatic Model Downloads**: You do *not* need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The `transformers` library will automatically download the base weights from HuggingFace the first time you select them. 2. **Download Time**: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant. 3. **Custom VLM Weights**: The `shakespeare_transformer.pt` file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading. 4. **Fine-Tuned Weights**: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (`python train.py --model [name]`). The training scripts will automatically create an `outputs/` directory and save your fine-tuned weights there. --- ## Training All four models are trained through one unified script: ```bash # Train individual models python train.py --model blip # ~1.5 hours on Apple Silicon python train.py --model vit_gpt2 # ~1 hour python train.py --model git # ~20 minutes python train.py --model custom # ~3 hours (15 epochs) ``` ### What happens during training 1. **Dataset loading** — Downloads MS-COCO captions from HuggingFace (cached after first download) 2. **Training** — Images are processed by the vision encoder, captions by the text decoder 3. **Validation** — After each epoch, computes validation loss + CIDEr score on held-out images 4. **Checkpointing** — Saves two checkpoints: - `outputs/{model}/best/` — The model with the **highest CIDEr score** (use this for evaluation) - `outputs/{model}/latest/` — The most recent epoch (use for debugging or continuing training) ### Key hyperparameters | | BLIP | ViT-GPT2 | GIT | Custom VLM | |-|---|---|---|---| | Training epochs | 3 | 3 | 3 | 15 | | Learning rate | 1e-5 | 2e-5 | 2e-5 | 1e-4 / 5e-5 | | Batch size | 16 | 8 | 8 | 16 | | Effective batch size | 64 | 32 | 32 | 64 | | Training images | 30,000 | 15,000 | 15,000 | 15,000 | --- ## Evaluation ### Basic evaluation ```bash # Evaluate a single model (computes CIDEr score) python eval.py --model blip --weights best # Evaluate with pre-trained weights (no fine-tuning) python eval.py --model blip --weights base # Compare all models side by side python eval.py --model all --weights best ``` ### Experiments ```bash # Cross-attention masking experiment: what happens when we hide parts of the image? python eval.py --model blip --ablation --weights best # Decoding parameter sweep: find the best beam search settings python eval.py --model blip --sweep --weights best # Caption filtering analysis: does training data quality matter? python eval.py --model blip --data-prep-analysis --weights best ``` ### Custom decoding settings ```bash python eval.py --model blip --weights best \ --num_beams 10 \ --max_new_tokens 50 \ --length_penalty 1.2 ``` ### All command-line options | Flag | Values | Default | What It Controls | |---|---|---|---| | `--model` | blip, vit_gpt2, git, custom, all | blip | Which model(s) to evaluate | | `--weights` | base, finetuned, best | base | Which checkpoint to load | | `--eval_batches` | any integer | 25 | How many validation batches to evaluate | | `--num_beams` | 1–10+ | 10 | Beam search width (more = better but slower) | | `--max_new_tokens` | 10–100 | 50 | Maximum caption length | | `--length_penalty` | 0.5–2.0 | 1.2 | < 1.0 = longer captions, > 1.0 = shorter | | `--ablation` | flag | off | Run the cross-attention masking experiment | | `--sweep` | flag | off | Run the decoding parameter sweep | | `--data-prep-analysis` | flag | off | Run the caption filtering comparison | --- ## Streamlit Demo ```bash streamlit run app.py ``` The demo provides three tabs: ### 🖼️ Caption Tab Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode. ### 📊 Compare All Models Tab Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption. ### 📈 Experiment Results Tab Browse pre-computed results from all three experiments. ### Sidebar Controls - **Weight Source** — Switch between pre-trained models and your fine-tuned checkpoints - **Architecture** — Select any of the four models (each has an info card explaining its approach) - **Generation Mode** — Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM - **Advanced Controls** — Adjust beam width, temperature, length penalty, top-k, and top-p > **Safety:** All captions pass through a toxicity filter (`detoxify`) before being displayed. --- ## Configuration Hyperparameters are managed through Python dataclasses in `configs/`: ``` configs/ ├── base_config.py # Shared defaults (batch size, image size, optimizer settings) ├── blip_config.py # BLIP-specific overrides ├── vit_gpt2_config.py # ViT-GPT2-specific overrides ├── git_config.py # GIT-specific overrides └── custom_vlm_config.py # Custom VLM overrides (decoder architecture, learning rates) ``` Access any config in code: ```python from configs import get_config cfg = get_config("blip") # Returns BlipConfig instance with all settings ``` --- ## Experiments & Key Results ### 1. Cross-Attention Masking: What Happens When We Hide Image Patches? | What We Did | CIDEr Score | Change | |---|---|---| | Showed the full image | 0.5371 | — Baseline | | Hid 50% of image patches randomly | 0.5371 | **No change** | | Showed only the center of the image | 0.5371 | **No change** | | Compressed entire image to 1 token | 0.0008 | **−99.8%** | **Takeaway:** Half the image patches are redundant, but spatial structure is essential. ### 2. Beam Search Settings: What Produces the Best Captions? **Best configuration found:** beam_size=10, length_penalty=1.2, max_tokens=50 → **CIDEr: 0.6199** More beams and slight preference for conciseness improve caption quality by ~13%. ### 3. Caption Filtering: Does Training Data Quality Matter? | Strategy | CIDEr | |---|---| | Raw (no filtering) | **0.6359** | | Filtered (5–25 words) | 0.5877 | Raw works best for this already-clean dataset. Filtering recommended for noisier data. --- ## Project Structure ``` project_02/ ├── app.py # Streamlit web demo (3 tabs) ├── config.py # Backward-compatible config wrapper ├── data_prep.py # Dataset loading + caption filtering ├── eval.py # CIDEr evaluator + experiment runner ├── train.py # Unified training loop for all 4 models ├── requirements.txt # Python dependencies ├── input.txt # Shakespeare corpus (vocabulary source) ├── shakespeare_transformer.pt # Pre-trained Shakespeare decoder weights │ ├── configs/ # Hyperparameter configs │ ├── base_config.py # Shared defaults │ ├── blip_config.py # BLIP settings │ ├── vit_gpt2_config.py # ViT-GPT2 settings │ ├── git_config.py # GIT settings │ └── custom_vlm_config.py # Custom VLM settings │ ├── models/ # Model implementations │ ├── blip_tuner.py # BLIP (gated cross-attention) │ ├── vit_gpt2_tuner.py # ViT-GPT2 (full cross-attention) │ ├── git_tuner.py # GIT (no cross-attention) │ └── custom_vlm.py # Custom VLM (visual prefix-tuning) │ ├── experiments/ # Experiment scripts and results │ ├── ablation_study.py # Image masking experiment │ ├── parameter_sweep.py # Beam search settings sweep │ ├── data_prep_analysis.py # Caption filtering comparison │ └── cross_attention_patterns.py # Architecture comparison table │ ├── outputs/ # Saved model checkpoints │ ├── blip/{best,latest}/ │ └── custom_vlm/{best,latest}/ │ ├── detailed_technical_report_cross_attention_vlm_image_captioning.md ├── simplified_overview_vlm_image_captioning_project.md └── README.md # This file ``` --- ## Tech Stack | Component | Technology | |---|---| | Training Framework | PyTorch + HuggingFace Transformers | | Dataset | MS-COCO Captions (via HuggingFace Datasets) | | Evaluation Metric | CIDEr (via pycocoevalcap) | | Safety Filter | detoxify (toxicity detection) | | Web Demo | Streamlit | | Hardware | Apple Silicon Mac with MPS acceleration | --- ## Author **Manoj Kumar** — March 2026