Spaces:
Running
Running
| title: VLM Caption Lab | |
| emoji: πΌοΈ | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: streamlit | |
| sdk_version: 1.40.0 | |
| app_file: app.py | |
| pinned: false | |
| # π¬ VLM Caption Lab | |
| **Compare how different Vision-Language Models look at images while writing captions β four architectures, one dataset, one evaluation metric.** | |
| VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to **image captioning** (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo. | |
| --- | |
| ## Architecture Comparison | |
| | Architecture | How It Looks at the Image | Total Parameters | Best CIDEr Score | | |
| |---|---|---|---| | |
| | **BLIP** | Selective gated attention β looks at image only when needed | 224M | **0.6199** (optimized) | | |
| | **ViT-GPT2** | Full attention β looks at entire image for every word | 239M | ~0.55 | | |
| | **GIT** | Memory-based β memorizes image first, writes from memory | 177M | ~0.54 | | |
| | **Custom VLM** | Built from scratch β Shakespeare decoder + visual bridge | 103M (16.2M trainable) | **0.2863** | | |
| > **What is CIDEr?** CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references. | |
| --- | |
| ## π Live Demo & Deployment | |
| **The easiest way to test this project is via the live web demo.** | |
| > π **[Insert Your Live Hosted Link Here]** | |
| *(If deploying yourself, see the `DEPLOYMENT_GUIDE.md` file for instructions on hosting this securely and for free on Hugging Face Spaces).* | |
| --- | |
| ## Quick Start (Local Run) | |
| If you prefer to run this locally rather than using the web demo, follow these steps. | |
| > β οΈ **Note on Weights**: You do *not* need to train the models yourself to test the app. | |
| > - Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run. | |
| > - The Custom VLM text-decoder weights (`shakespeare_transformer.pt`) are included in this repo. | |
| > - **To skip training completely**, you only need to run `streamlit run app.py`! | |
| ### Prerequisites | |
| - Python 3.9 or newer | |
| - macOS with Apple Silicon (MPS) or Linux with a CUDA GPU | |
| - ~8 GB disk space for model checkpoints | |
| ### Setup | |
| ```bash | |
| # Clone the repository | |
| git clone <repo-url> | |
| cd project_02 | |
| # Create a virtual environment | |
| python -m venv venv | |
| source venv/bin/activate | |
| # Install all dependencies | |
| pip install -r requirements.txt | |
| # Verify that GPU acceleration is available | |
| python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())" | |
| ``` | |
| ### Dependencies | |
| | Package | What It Does | | |
| |---|---| | |
| | `torch` | Deep learning framework (training and inference) | | |
| | `transformers` | Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace | | |
| | `datasets` | Download and load MS-COCO caption dataset from HuggingFace | | |
| | `streamlit` | Interactive web demo interface | | |
| | `pycocoevalcap` | Compute CIDEr scores (caption quality metric) | | |
| | `detoxify` | Safety filter β checks captions for toxic or offensive content | | |
| | `Pillow` | Image loading and processing | | |
| | `accelerate` | Training efficiency utilities | | |
| --- | |
| ## π What to Expect on First Run | |
| When someone clones this repository and runs `streamlit run app.py` (or `train.py`) for the very first time, here is exactly what happens: | |
| 1. **Automatic Model Downloads**: You do *not* need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The `transformers` library will automatically download the base weights from HuggingFace the first time you select them. | |
| 2. **Download Time**: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant. | |
| 3. **Custom VLM Weights**: The `shakespeare_transformer.pt` file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading. | |
| 4. **Fine-Tuned Weights**: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (`python train.py --model [name]`). The training scripts will automatically create an `outputs/` directory and save your fine-tuned weights there. | |
| --- | |
| ## Training | |
| All four models are trained through one unified script: | |
| ```bash | |
| # Train individual models | |
| python train.py --model blip # ~1.5 hours on Apple Silicon | |
| python train.py --model vit_gpt2 # ~1 hour | |
| python train.py --model git # ~20 minutes | |
| python train.py --model custom # ~3 hours (15 epochs) | |
| ``` | |
| ### What happens during training | |
| 1. **Dataset loading** β Downloads MS-COCO captions from HuggingFace (cached after first download) | |
| 2. **Training** β Images are processed by the vision encoder, captions by the text decoder | |
| 3. **Validation** β After each epoch, computes validation loss + CIDEr score on held-out images | |
| 4. **Checkpointing** β Saves two checkpoints: | |
| - `outputs/{model}/best/` β The model with the **highest CIDEr score** (use this for evaluation) | |
| - `outputs/{model}/latest/` β The most recent epoch (use for debugging or continuing training) | |
| ### Key hyperparameters | |
| | | BLIP | ViT-GPT2 | GIT | Custom VLM | | |
| |-|---|---|---|---| | |
| | Training epochs | 3 | 3 | 3 | 15 | | |
| | Learning rate | 1e-5 | 2e-5 | 2e-5 | 1e-4 / 5e-5 | | |
| | Batch size | 16 | 8 | 8 | 16 | | |
| | Effective batch size | 64 | 32 | 32 | 64 | | |
| | Training images | 30,000 | 15,000 | 15,000 | 15,000 | | |
| --- | |
| ## Evaluation | |
| ### Basic evaluation | |
| ```bash | |
| # Evaluate a single model (computes CIDEr score) | |
| python eval.py --model blip --weights best | |
| # Evaluate with pre-trained weights (no fine-tuning) | |
| python eval.py --model blip --weights base | |
| # Compare all models side by side | |
| python eval.py --model all --weights best | |
| ``` | |
| ### Experiments | |
| ```bash | |
| # Cross-attention masking experiment: what happens when we hide parts of the image? | |
| python eval.py --model blip --ablation --weights best | |
| # Decoding parameter sweep: find the best beam search settings | |
| python eval.py --model blip --sweep --weights best | |
| # Caption filtering analysis: does training data quality matter? | |
| python eval.py --model blip --data-prep-analysis --weights best | |
| ``` | |
| ### Custom decoding settings | |
| ```bash | |
| python eval.py --model blip --weights best \ | |
| --num_beams 10 \ | |
| --max_new_tokens 50 \ | |
| --length_penalty 1.2 | |
| ``` | |
| ### All command-line options | |
| | Flag | Values | Default | What It Controls | | |
| |---|---|---|---| | |
| | `--model` | blip, vit_gpt2, git, custom, all | blip | Which model(s) to evaluate | | |
| | `--weights` | base, finetuned, best | base | Which checkpoint to load | | |
| | `--eval_batches` | any integer | 25 | How many validation batches to evaluate | | |
| | `--num_beams` | 1β10+ | 10 | Beam search width (more = better but slower) | | |
| | `--max_new_tokens` | 10β100 | 50 | Maximum caption length | | |
| | `--length_penalty` | 0.5β2.0 | 1.2 | < 1.0 = longer captions, > 1.0 = shorter | | |
| | `--ablation` | flag | off | Run the cross-attention masking experiment | | |
| | `--sweep` | flag | off | Run the decoding parameter sweep | | |
| | `--data-prep-analysis` | flag | off | Run the caption filtering comparison | | |
| --- | |
| ## Streamlit Demo | |
| ```bash | |
| streamlit run app.py | |
| ``` | |
| The demo provides three tabs: | |
| ### πΌοΈ Caption Tab | |
| Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode. | |
| ### π Compare All Models Tab | |
| Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption. | |
| ### π Experiment Results Tab | |
| Browse pre-computed results from all three experiments. | |
| ### Sidebar Controls | |
| - **Weight Source** β Switch between pre-trained models and your fine-tuned checkpoints | |
| - **Architecture** β Select any of the four models (each has an info card explaining its approach) | |
| - **Generation Mode** β Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM | |
| - **Advanced Controls** β Adjust beam width, temperature, length penalty, top-k, and top-p | |
| > **Safety:** All captions pass through a toxicity filter (`detoxify`) before being displayed. | |
| --- | |
| ## Configuration | |
| Hyperparameters are managed through Python dataclasses in `configs/`: | |
| ``` | |
| configs/ | |
| βββ base_config.py # Shared defaults (batch size, image size, optimizer settings) | |
| βββ blip_config.py # BLIP-specific overrides | |
| βββ vit_gpt2_config.py # ViT-GPT2-specific overrides | |
| βββ git_config.py # GIT-specific overrides | |
| βββ custom_vlm_config.py # Custom VLM overrides (decoder architecture, learning rates) | |
| ``` | |
| Access any config in code: | |
| ```python | |
| from configs import get_config | |
| cfg = get_config("blip") # Returns BlipConfig instance with all settings | |
| ``` | |
| --- | |
| ## Experiments & Key Results | |
| ### 1. Cross-Attention Masking: What Happens When We Hide Image Patches? | |
| | What We Did | CIDEr Score | Change | | |
| |---|---|---| | |
| | Showed the full image | 0.5371 | β Baseline | | |
| | Hid 50% of image patches randomly | 0.5371 | **No change** | | |
| | Showed only the center of the image | 0.5371 | **No change** | | |
| | Compressed entire image to 1 token | 0.0008 | **β99.8%** | | |
| **Takeaway:** Half the image patches are redundant, but spatial structure is essential. | |
| ### 2. Beam Search Settings: What Produces the Best Captions? | |
| **Best configuration found:** beam_size=10, length_penalty=1.2, max_tokens=50 β **CIDEr: 0.6199** | |
| More beams and slight preference for conciseness improve caption quality by ~13%. | |
| ### 3. Caption Filtering: Does Training Data Quality Matter? | |
| | Strategy | CIDEr | | |
| |---|---| | |
| | Raw (no filtering) | **0.6359** | | |
| | Filtered (5β25 words) | 0.5877 | | |
| Raw works best for this already-clean dataset. Filtering recommended for noisier data. | |
| --- | |
| ## Project Structure | |
| ``` | |
| project_02/ | |
| βββ app.py # Streamlit web demo (3 tabs) | |
| βββ config.py # Backward-compatible config wrapper | |
| βββ data_prep.py # Dataset loading + caption filtering | |
| βββ eval.py # CIDEr evaluator + experiment runner | |
| βββ train.py # Unified training loop for all 4 models | |
| βββ requirements.txt # Python dependencies | |
| βββ input.txt # Shakespeare corpus (vocabulary source) | |
| βββ shakespeare_transformer.pt # Pre-trained Shakespeare decoder weights | |
| β | |
| βββ configs/ # Hyperparameter configs | |
| β βββ base_config.py # Shared defaults | |
| β βββ blip_config.py # BLIP settings | |
| β βββ vit_gpt2_config.py # ViT-GPT2 settings | |
| β βββ git_config.py # GIT settings | |
| β βββ custom_vlm_config.py # Custom VLM settings | |
| β | |
| βββ models/ # Model implementations | |
| β βββ blip_tuner.py # BLIP (gated cross-attention) | |
| β βββ vit_gpt2_tuner.py # ViT-GPT2 (full cross-attention) | |
| β βββ git_tuner.py # GIT (no cross-attention) | |
| β βββ custom_vlm.py # Custom VLM (visual prefix-tuning) | |
| β | |
| βββ experiments/ # Experiment scripts and results | |
| β βββ ablation_study.py # Image masking experiment | |
| β βββ parameter_sweep.py # Beam search settings sweep | |
| β βββ data_prep_analysis.py # Caption filtering comparison | |
| β βββ cross_attention_patterns.py # Architecture comparison table | |
| β | |
| βββ outputs/ # Saved model checkpoints | |
| β βββ blip/{best,latest}/ | |
| β βββ custom_vlm/{best,latest}/ | |
| β | |
| βββ detailed_technical_report_cross_attention_vlm_image_captioning.md | |
| βββ simplified_overview_vlm_image_captioning_project.md | |
| βββ README.md # This file | |
| ``` | |
| --- | |
| ## Tech Stack | |
| | Component | Technology | | |
| |---|---| | |
| | Training Framework | PyTorch + HuggingFace Transformers | | |
| | Dataset | MS-COCO Captions (via HuggingFace Datasets) | | |
| | Evaluation Metric | CIDEr (via pycocoevalcap) | | |
| | Safety Filter | detoxify (toxicity detection) | | |
| | Web Demo | Streamlit | | |
| | Hardware | Apple Silicon Mac with MPS acceleration | | |
| --- | |
| ## Author | |
| **Manoj Kumar** β March 2026 | |