Spaces:

griddev
/

project_02_DS

Running

App Files Files Community

project_02_DS / README.md

griddev

Add Hugging Face Spaces YAML frontmatter to README

82bedcd 11 days ago

preview code

raw

history blame contribute delete

12.8 kB

	---
	title: VLM Caption Lab
	emoji: 🖼️
	colorFrom: blue
	colorTo: indigo
	sdk: streamlit
	sdk_version: 1.40.0
	app_file: app.py
	pinned: false
	---

	# 🔬 VLM Caption Lab

	Compare how different Vision-Language Models look at images while writing captions — four architectures, one dataset, one evaluation metric.

	VLM Caption Lab is a complete Python toolkit for training, evaluating, and interactively comparing four fundamentally different approaches to image captioning (the task of generating a text description of a photograph). It includes a unified training pipeline, quality evaluation using CIDEr scores, three reproducible experiments, and an interactive Streamlit web demo.

	---

	## Architecture Comparison

	\| Architecture \| How It Looks at the Image \| Total Parameters \| Best CIDEr Score \|
	\|---\|---\|---\|---\|
	\| BLIP \| Selective gated attention — looks at image only when needed \| 224M \| 0.6199 (optimized) \|
	\| ViT-GPT2 \| Full attention — looks at entire image for every word \| 239M \| ~0.55 \|
	\| GIT \| Memory-based — memorizes image first, writes from memory \| 177M \| ~0.54 \|
	\| Custom VLM \| Built from scratch — Shakespeare decoder + visual bridge \| 103M (16.2M trainable) \| 0.2863 \|

	> What is CIDEr? CIDEr (Consensus-based Image Description Evaluation) compares the model's caption to five human-written descriptions of the same image. Higher = better. A score of 1.0 means perfect overlap with human references.

	---

	## 🌐 Live Demo & Deployment

	The easiest way to test this project is via the live web demo.
	> 👉 [Insert Your Live Hosted Link Here]

	(If deploying yourself, see the `DEPLOYMENT_GUIDE.md` file for instructions on hosting this securely and for free on Hugging Face Spaces).

	---

	## Quick Start (Local Run)

	If you prefer to run this locally rather than using the web demo, follow these steps.

	> ⚠️ Note on Weights: You do not need to train the models yourself to test the app.
	> - Base model weights (BLIP, ViT-GPT2) will download automatically from Hugging Face on the first run.
	> - The Custom VLM text-decoder weights (`shakespeare_transformer.pt`) are included in this repo.
	> - To skip training completely, you only need to run `streamlit run app.py`!

	### Prerequisites

	- Python 3.9 or newer
	- macOS with Apple Silicon (MPS) or Linux with a CUDA GPU
	- ~8 GB disk space for model checkpoints

	### Setup

	```bash
	# Clone the repository
	git clone <repo-url>
	cd project_02

	# Create a virtual environment
	python -m venv venv
	source venv/bin/activate

	# Install all dependencies
	pip install -r requirements.txt

	# Verify that GPU acceleration is available
	python -c "import torch; print('MPS:', torch.backends.mps.is_available()); print('CUDA:', torch.cuda.is_available())"
	```

	### Dependencies

	\| Package \| What It Does \|
	\|---\|---\|
	\| `torch` \| Deep learning framework (training and inference) \|
	\| `transformers` \| Load pre-trained BLIP, ViT-GPT2, and GIT models from HuggingFace \|
	\| `datasets` \| Download and load MS-COCO caption dataset from HuggingFace \|
	\| `streamlit` \| Interactive web demo interface \|
	\| `pycocoevalcap` \| Compute CIDEr scores (caption quality metric) \|
	\| `detoxify` \| Safety filter — checks captions for toxic or offensive content \|
	\| `Pillow` \| Image loading and processing \|
	\| `accelerate` \| Training efficiency utilities \|

	---

	## 🚀 What to Expect on First Run

	When someone clones this repository and runs `streamlit run app.py` (or `train.py`) for the very first time, here is exactly what happens:

	1. Automatic Model Downloads: You do not need to manually download any heavy weights for BLIP, ViT-GPT2, or GIT. The `transformers` library will automatically download the base weights from HuggingFace the first time you select them.
	2. Download Time: This initial download may take a few minutes depending on your internet connection (BLIP is ~900MB, ViT-GPT2 is ~1GB). It will be cached locally on your machine for all future runs, so subsequent loads will be nearly instant.
	3. Custom VLM Weights: The `shakespeare_transformer.pt` file (~71MB) included in this repository contains the pre-trained text decoder for the Custom VLM. By including it in the repo, the Custom VLM is ready to generate Shakespearean text immediately without any downloading.
	4. Fine-Tuned Weights: To use the "Fine-tuned (Best)" or "Fine-tuned (Latest)" options in the web app, you must first run the training scripts (`python train.py --model [name]`). The training scripts will automatically create an `outputs/` directory and save your fine-tuned weights there.

	---

	## Training

	All four models are trained through one unified script:

	```bash
	# Train individual models
	python train.py --model blip # ~1.5 hours on Apple Silicon
	python train.py --model vit_gpt2 # ~1 hour
	python train.py --model git # ~20 minutes
	python train.py --model custom # ~3 hours (15 epochs)
	```

	### What happens during training

	1. Dataset loading — Downloads MS-COCO captions from HuggingFace (cached after first download)
	2. Training — Images are processed by the vision encoder, captions by the text decoder
	3. Validation — After each epoch, computes validation loss + CIDEr score on held-out images
	4. Checkpointing — Saves two checkpoints:
	- `outputs/{model}/best/` — The model with the highest CIDEr score (use this for evaluation)
	- `outputs/{model}/latest/` — The most recent epoch (use for debugging or continuing training)

	### Key hyperparameters

	\| \| BLIP \| ViT-GPT2 \| GIT \| Custom VLM \|
	\|-\|---\|---\|---\|---\|
	\| Training epochs \| 3 \| 3 \| 3 \| 15 \|
	\| Learning rate \| 1e-5 \| 2e-5 \| 2e-5 \| 1e-4 / 5e-5 \|
	\| Batch size \| 16 \| 8 \| 8 \| 16 \|
	\| Effective batch size \| 64 \| 32 \| 32 \| 64 \|
	\| Training images \| 30,000 \| 15,000 \| 15,000 \| 15,000 \|

	---

	## Evaluation

	### Basic evaluation

	```bash
	# Evaluate a single model (computes CIDEr score)
	python eval.py --model blip --weights best

	# Evaluate with pre-trained weights (no fine-tuning)
	python eval.py --model blip --weights base

	# Compare all models side by side
	python eval.py --model all --weights best
	```

	### Experiments

	```bash
	# Cross-attention masking experiment: what happens when we hide parts of the image?
	python eval.py --model blip --ablation --weights best

	# Decoding parameter sweep: find the best beam search settings
	python eval.py --model blip --sweep --weights best

	# Caption filtering analysis: does training data quality matter?
	python eval.py --model blip --data-prep-analysis --weights best
	```

	### Custom decoding settings

	```bash
	python eval.py --model blip --weights best \
	--num_beams 10 \
	--max_new_tokens 50 \
	--length_penalty 1.2
	```

	### All command-line options

	\| Flag \| Values \| Default \| What It Controls \|
	\|---\|---\|---\|---\|
	\| `--model` \| blip, vit_gpt2, git, custom, all \| blip \| Which model(s) to evaluate \|
	\| `--weights` \| base, finetuned, best \| base \| Which checkpoint to load \|
	\| `--eval_batches` \| any integer \| 25 \| How many validation batches to evaluate \|
	\| `--num_beams` \| 1–10+ \| 10 \| Beam search width (more = better but slower) \|
	\| `--max_new_tokens` \| 10–100 \| 50 \| Maximum caption length \|
	\| `--length_penalty` \| 0.5–2.0 \| 1.2 \| < 1.0 = longer captions, > 1.0 = shorter \|
	\| `--ablation` \| flag \| off \| Run the cross-attention masking experiment \|
	\| `--sweep` \| flag \| off \| Run the decoding parameter sweep \|
	\| `--data-prep-analysis` \| flag \| off \| Run the caption filtering comparison \|

	---

	## Streamlit Demo

	```bash
	streamlit run app.py
	```

	The demo provides three tabs:

	### 🖼️ Caption Tab
	Upload any image and generate a caption. Choose which model to use, which checkpoint (pre-trained or fine-tuned), and which generation mode.

	### 📊 Compare All Models Tab
	Run all four architectures simultaneously on the same image. Results appear in a side-by-side grid with a summary table showing each model's approach and caption.

	### 📈 Experiment Results Tab
	Browse pre-computed results from all three experiments.

	### Sidebar Controls
	- Weight Source — Switch between pre-trained models and your fine-tuned checkpoints
	- Architecture — Select any of the four models (each has an info card explaining its approach)
	- Generation Mode — Choose masking modes for BLIP/ViT-GPT2 or Shakespeare Prefix for Custom VLM
	- Advanced Controls — Adjust beam width, temperature, length penalty, top-k, and top-p

	> Safety: All captions pass through a toxicity filter (`detoxify`) before being displayed.

	---

	## Configuration

	Hyperparameters are managed through Python dataclasses in `configs/`:

	```
	configs/
	├── base_config.py # Shared defaults (batch size, image size, optimizer settings)
	├── blip_config.py # BLIP-specific overrides
	├── vit_gpt2_config.py # ViT-GPT2-specific overrides
	├── git_config.py # GIT-specific overrides
	└── custom_vlm_config.py # Custom VLM overrides (decoder architecture, learning rates)
	```

	Access any config in code:

	```python
	from configs import get_config
	cfg = get_config("blip") # Returns BlipConfig instance with all settings
	```

	---

	## Experiments & Key Results

	### 1. Cross-Attention Masking: What Happens When We Hide Image Patches?

	\| What We Did \| CIDEr Score \| Change \|
	\|---\|---\|---\|
	\| Showed the full image \| 0.5371 \| — Baseline \|
	\| Hid 50% of image patches randomly \| 0.5371 \| No change \|
	\| Showed only the center of the image \| 0.5371 \| No change \|
	\| Compressed entire image to 1 token \| 0.0008 \| −99.8% \|

	Takeaway: Half the image patches are redundant, but spatial structure is essential.

	### 2. Beam Search Settings: What Produces the Best Captions?

	Best configuration found: beam_size=10, length_penalty=1.2, max_tokens=50 → CIDEr: 0.6199

	More beams and slight preference for conciseness improve caption quality by ~13%.

	### 3. Caption Filtering: Does Training Data Quality Matter?

	\| Strategy \| CIDEr \|
	\|---\|---\|
	\| Raw (no filtering) \| 0.6359 \|
	\| Filtered (5–25 words) \| 0.5877 \|

	Raw works best for this already-clean dataset. Filtering recommended for noisier data.

	---

	## Project Structure

	```
	project_02/
	├── app.py # Streamlit web demo (3 tabs)
	├── config.py # Backward-compatible config wrapper
	├── data_prep.py # Dataset loading + caption filtering
	├── eval.py # CIDEr evaluator + experiment runner
	├── train.py # Unified training loop for all 4 models
	├── requirements.txt # Python dependencies
	├── input.txt # Shakespeare corpus (vocabulary source)
	├── shakespeare_transformer.pt # Pre-trained Shakespeare decoder weights
	│
	├── configs/ # Hyperparameter configs
	│ ├── base_config.py # Shared defaults
	│ ├── blip_config.py # BLIP settings
	│ ├── vit_gpt2_config.py # ViT-GPT2 settings
	│ ├── git_config.py # GIT settings
	│ └── custom_vlm_config.py # Custom VLM settings
	│
	├── models/ # Model implementations
	│ ├── blip_tuner.py # BLIP (gated cross-attention)
	│ ├── vit_gpt2_tuner.py # ViT-GPT2 (full cross-attention)
	│ ├── git_tuner.py # GIT (no cross-attention)
	│ └── custom_vlm.py # Custom VLM (visual prefix-tuning)
	│
	├── experiments/ # Experiment scripts and results
	│ ├── ablation_study.py # Image masking experiment
	│ ├── parameter_sweep.py # Beam search settings sweep
	│ ├── data_prep_analysis.py # Caption filtering comparison
	│ └── cross_attention_patterns.py # Architecture comparison table
	│
	├── outputs/ # Saved model checkpoints
	│ ├── blip/{best,latest}/
	│ └── custom_vlm/{best,latest}/
	│
	├── detailed_technical_report_cross_attention_vlm_image_captioning.md
	├── simplified_overview_vlm_image_captioning_project.md
	└── README.md # This file
	```

	---

	## Tech Stack

	\| Component \| Technology \|
	\|---\|---\|
	\| Training Framework \| PyTorch + HuggingFace Transformers \|
	\| Dataset \| MS-COCO Captions (via HuggingFace Datasets) \|
	\| Evaluation Metric \| CIDEr (via pycocoevalcap) \|
	\| Safety Filter \| detoxify (toxicity detection) \|
	\| Web Demo \| Streamlit \|
	\| Hardware \| Apple Silicon Mac with MPS acceleration \|

	---

	## Author

	Manoj Kumar — March 2026