Spaces:

griddev
/

project_02_DS

Running

App Files Files Community

project_02_DS / README.md

griddev

Add Hugging Face Spaces YAML frontmatter to README

82bedcd 10 days ago

preview code

raw

history blame contribute delete

12.8 kB

A newer version of the Streamlit SDK is available: 1.55.0

Upgrade

metadata

title: VLM Caption Lab
emoji: 🖼️
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.40.0
app_file: app.py
pinned: false

🔬 VLM Caption Lab

Compare how different Vision-Language Models look at images while writing captions — four architectures, one dataset, one evaluation metric.

VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to image captioning (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.

Architecture Comparison

Architecture	How It Looks at the Image	Total Parameters	Best CIDEr Score
BLIP	Selective gated attention — looks at image only when needed	224M	0.6199 (optimized)
ViT-GPT2	Full attention — looks at entire image for every word	239M	~0.55
GIT	Memory-based — memorizes image first, writes from memory	177M	~0.54
Custom VLM	Built from scratch — Shakespeare decoder + visual bridge	103M (16.2M trainable)	0.2863

What is CIDEr? CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.

🌐 Live Demo & Deployment

The easiest way to test this project is via the live web demo.

👉 [Insert Your Live Hosted Link Here]

(If deploying yourself, see the DEPLOYMENT_GUIDE.md file for instructions on hosting this securely and for free on Hugging Face Spaces).

Quick Start (Local Run)

If you prefer to run this locally rather than using the web demo, follow these steps.

⚠️ Note on Weights: You do not need to train the models yourself to test the app.

Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.

The Custom VLM text-decoder weights (shakespeare_transformer.pt) are included in this repo.

To skip training completely, you only need to run streamlit run app.py!

Prerequisites

Python 3.9 or newer
macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
~8 GB disk space for model checkpoints

Setup

# Clone the repository
git clone <repo-url>
cd project_02

# Create a virtual environment
python -m venv venv
source venv/bin/activate

# Install all dependencies
pip install -r requirements.txt

# Verify that GPU acceleration is available
python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"

Dependencies

Package	What It Does
`torch`	Deep learning framework (training and inference)
`transformers`	Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace
`datasets`	Download and load MS-COCO caption dataset from HuggingFace
`streamlit`	Interactive web demo interface
`pycocoevalcap`	Compute CIDEr scores (caption quality metric)
`detoxify`	Safety filter — checks captions for toxic or offensive content
`Pillow`	Image loading and processing
`accelerate`	Training efficiency utilities

🚀 What to Expect on First Run

When someone clones this repository and runs streamlit run app.py (or train.py) for the very first time, here is exactly what happens:

Automatic Model Downloads: You do not need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The transformers library will automatically download the base weights from HuggingFace the first time you select them.
Download Time: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
Custom VLM Weights: The shakespeare_transformer.pt file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading.
Fine-Tuned Weights: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (python train.py --model [name]). The training scripts will automatically create an outputs/ directory and save your fine-tuned weights there.

Training

All four models are trained through one unified script:

# Train individual models
python train.py --model blip          # ~1.5 hours on Apple Silicon
python train.py --model vit_gpt2      # ~1 hour
python train.py --model git           # ~20 minutes
python train.py --model custom        # ~3 hours (15 epochs)

What happens during training

Dataset loading — Downloads MS-COCO captions from HuggingFace (cached after first download)
Training — Images are processed by the vision encoder, captions by the text decoder
Validation — After each epoch, computes validation loss + CIDEr score on held-out images
Checkpointing — Saves two checkpoints:
- outputs/{model}/best/ — The model with the highest CIDEr score (use this for evaluation)
- outputs/{model}/latest/ — The most recent epoch (use for debugging or continuing training)

Key hyperparameters

	BLIP	ViT-GPT2	GIT	Custom VLM
Training epochs	3	3	3	15
Learning rate	1e-5	2e-5	2e-5	1e-4 / 5e-5
Batch size	16	8	8	16
Effective batch size	64	32	32	64
Training images	30,000	15,000	15,000	15,000

Evaluation

Basic evaluation

# Evaluate a single model (computes CIDEr score)
python eval.py --model blip --weights best

# Evaluate with pre-trained weights (no fine-tuning)
python eval.py --model blip --weights base

# Compare all models side by side
python eval.py --model all --weights best

Experiments

# Cross-attention masking experiment: what happens when we hide parts of the image?
python eval.py --model blip --ablation --weights best

# Decoding parameter sweep: find the best beam search settings
python eval.py --model blip --sweep --weights best

# Caption filtering analysis: does training data quality matter?
python eval.py --model blip --data-prep-analysis --weights best

Custom decoding settings

python eval.py --model blip --weights best \
    --num_beams 10 \
    --max_new_tokens 50 \
    --length_penalty 1.2

All command-line options

Flag	Values	Default	What It Controls
`--model`	blip, vit_gpt2, git, custom, all	blip	Which model(s) to evaluate
`--weights`	base, finetuned, best	base	Which checkpoint to load
`--eval_batches`	any integer	25	How many validation batches to evaluate
`--num_beams`	1–10+	10	Beam search width (more = better but slower)
`--max_new_tokens`	10–100	50	Maximum caption length
`--length_penalty`	0.5–2.0	1.2	< 1.0 = longer captions, > 1.0 = shorter
`--ablation`	flag	off	Run the cross-attention masking experiment
`--sweep`	flag	off	Run the decoding parameter sweep
`--data-prep-analysis`	flag	off	Run the caption filtering comparison

Streamlit Demo

streamlit run app.py

The demo provides three tabs:

🖼️ Caption Tab

Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.

📊 Compare All Models Tab

Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.

📈 Experiment Results Tab

Browse pre-computed results from all three experiments.

Sidebar Controls

Weight Source — Switch between pre-trained models and your fine-tuned checkpoints
Architecture — Select any of the four models (each has an info card explaining its approach)
Generation Mode — Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
Advanced Controls — Adjust beam width, temperature, length penalty, top-k, and top-p

Safety: All captions pass through a toxicity filter (detoxify) before being displayed.

Configuration

Hyperparameters are managed through Python dataclasses in configs/:

configs/
├── base_config.py          # Shared defaults (batch size, image size, optimizer settings)
├── blip_config.py          # BLIP-specific overrides
├── vit_gpt2_config.py      # ViT-GPT2-specific overrides
├── git_config.py           # GIT-specific overrides
└── custom_vlm_config.py    # Custom VLM overrides (decoder architecture, learning rates)

Access any config in code:

from configs import get_config
cfg = get_config("blip")  # Returns BlipConfig instance with all settings

Experiments & Key Results

1. Cross-Attention Masking: What Happens When We Hide Image Patches?

What We Did	CIDEr Score	Change
Showed the full image	0.5371	— Baseline
Hid 50% of image patches randomly	0.5371	No change
Showed only the center of the image	0.5371	No change
Compressed entire image to 1 token	0.0008	−99.8%

Takeaway: Half the image patches are redundant, but spatial structure is essential.

2. Beam Search Settings: What Produces the Best Captions?

Best configuration found: beam_size=10, length_penalty=1.2, max_tokens=50 → CIDEr: 0.6199

More beams and slight preference for conciseness improve caption quality by ~13%.

3. Caption Filtering: Does Training Data Quality Matter?

Strategy	CIDEr
Raw (no filtering)	0.6359
Filtered (5–25 words)	0.5877

Raw works best for this already-clean dataset. Filtering recommended for noisier data.

Project Structure

project_02/
├── app.py                              # Streamlit web demo (3 tabs)
├── config.py                           # Backward-compatible config wrapper
├── data_prep.py                        # Dataset loading + caption filtering
├── eval.py                             # CIDEr evaluator + experiment runner
├── train.py                            # Unified training loop for all 4 models
├── requirements.txt                    # Python dependencies
├── input.txt                           # Shakespeare corpus (vocabulary source)
├── shakespeare_transformer.pt          # Pre-trained Shakespeare decoder weights
│
├── configs/                            # Hyperparameter configs
│   ├── base_config.py                  # Shared defaults
│   ├── blip_config.py                  # BLIP settings
│   ├── vit_gpt2_config.py             # ViT-GPT2 settings
│   ├── git_config.py                   # GIT settings
│   └── custom_vlm_config.py            # Custom VLM settings
│
├── models/                             # Model implementations
│   ├── blip_tuner.py                   # BLIP (gated cross-attention)
│   ├── vit_gpt2_tuner.py              # ViT-GPT2 (full cross-attention)
│   ├── git_tuner.py                    # GIT (no cross-attention)
│   └── custom_vlm.py                  # Custom VLM (visual prefix-tuning)
│
├── experiments/                        # Experiment scripts and results
│   ├── ablation_study.py              # Image masking experiment
│   ├── parameter_sweep.py             # Beam search settings sweep
│   ├── data_prep_analysis.py          # Caption filtering comparison
│   └── cross_attention_patterns.py    # Architecture comparison table
│
├── outputs/                            # Saved model checkpoints
│   ├── blip/{best,latest}/
│   └── custom_vlm/{best,latest}/
│
├── detailed_technical_report_cross_attention_vlm_image_captioning.md
├── simplified_overview_vlm_image_captioning_project.md
└── README.md                           # This file

Tech Stack

Component	Technology
Training Framework	PyTorch + HuggingFace Transformers
Dataset	MS-COCO Captions (via HuggingFace Datasets)
Evaluation Metric	CIDEr (via pycocoevalcap)
Safety Filter	detoxify (toxicity detection)
Web Demo	Streamlit
Hardware	Apple Silicon Mac with MPS acceleration

Author

Manoj Kumar — March 2026