image_captioning / README.md
pchandragrid's picture
Add Spaces README metadata
4482ecc
---
title: Image Captioning
emoji: 🖼️
colorFrom: indigo
colorTo: pink
sdk: streamlit
python_version: "3.10"
app_file: app.py
pinned: false
---
# Image Captioning (Streamlit)
This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models.
## Why your models should NOT be inside the app repo
Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when:
- the app repo stays small
- models live on the Hugging Face Hub (or S3/GCS)
- the app downloads models at startup (cached by `transformers`)
## 1) Upload your saved models to Hugging Face Hub
Example for BLIP (you already have `uploadtohf.py`):
```bash
pip install -U transformers huggingface_hub
huggingface-cli login
python uploadtohf.py
```
Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos.
## 2) Configure the app to load from Hub
`app.py` loads **local folders if present**, otherwise falls back to Hub IDs via environment variables:
- `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`)
In this repo, defaults are set to:
- `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`)
You can also override local folder names:
- `BLIP_LOCAL_DIR` (default: `saved_model_phase2`)
- `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`)
- `GIT_LOCAL_DIR` (default: `saved_git_model`)
## 3) Deploy options
### Option A: Hugging Face Spaces (recommended)
- Create a new Space: **Streamlit**
- Push this repo (must include `app.py` + `requirements.txt`)
- In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos
- If any model repo is private, add `HF_TOKEN` as a Space **Secret**
### Option B: Streamlit Community Cloud
- Point it to this repo
- Set the same env vars in the app settings
## Local run
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```
# 🖼️ Image Captioning with BLIP (COCO Subset)
## 📌 Problem
Generate natural language descriptions for images using transformer-based vision-language models.
Goal:
- Improve CIDEr score by 10%+
- Compare architectures (BLIP vs ViT-GPT2)
- Analyze resolution impact (224 vs 320 vs 384)
- Optimize decoding parameters
- Deploy minimal inference UI
---
## 📂 Dataset
- MS COCO Captions (subset: 10k & 20k)
- Random caption selection (5 captions per image)
- Experiments:
- Short captions
- Mixed captions
- Filtered captions
Train/Validation split: 90/10
---
## 🧠 Models
### 1️⃣ BLIP (Primary Model)
- Salesforce/blip-image-captioning-base
- Vision encoder frozen (for efficiency)
- Gradient checkpointing enabled
- Mixed precision on MPS
### 2️⃣ ViT-GPT2 (Comparison)
- ViT base encoder
- GPT2 decoder with cross-attention
---
## 🧪 Experiments
### Resolution Comparison
| Resolution | Dataset | CIDEr |
|------------|---------|--------|
| 224px | 10k | ~1.28 |
| 320px | 20k | ~1.33–1.38 |
| 384px | 20k | ~1.40+ |
### Beam Search Tuning
Tested:
- Beams: 3, 5, 8
- Length penalty: 0.8, 1.0, 1.2
- Max length: 20, 30, 40
Best config:
Beams=5, MaxLen=20, LengthPenalty=1.0
---
## 📊 Evaluation Metric
- CIDEr (via pycocoevalcap)
- Validation loss
- Confidence estimation
---
## 🖥️ Demo
Streamlit app includes:
- Image uploader
- Beam controls
- Toxicity filtering
- Confidence display
- Attention heatmap
Run:
```bash
streamlit run app.py