---
title: Image Captioning
emoji: 🖼️
colorFrom: indigo
colorTo: pink
sdk: streamlit
python_version: "3.10"
app_file: app.py
pinned: false
---

# Image Captioning (Streamlit)

This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models.

## Why your models should NOT be inside the app repo

Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when:

- the app repo stays small
- models live on the Hugging Face Hub (or S3/GCS)
- the app downloads models at startup (cached by `transformers`)

## 1) Upload your saved models to Hugging Face Hub

Example for BLIP (you already have `uploadtohf.py`):

```bash
pip install -U transformers huggingface_hub
huggingface-cli login
python uploadtohf.py
```

Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos.

## 2) Configure the app to load from Hub

`app.py` loads **local folders if present**, otherwise falls back to Hub IDs via environment variables:

- `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`)
 
In this repo, defaults are set to:
 
- `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`)

You can also override local folder names:

- `BLIP_LOCAL_DIR` (default: `saved_model_phase2`)
- `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`)
- `GIT_LOCAL_DIR` (default: `saved_git_model`)

## 3) Deploy options

### Option A: Hugging Face Spaces (recommended)

- Create a new Space: **Streamlit**
- Push this repo (must include `app.py` + `requirements.txt`)
- In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos
- If any model repo is private, add `HF_TOKEN` as a Space **Secret**

### Option B: Streamlit Community Cloud

- Point it to this repo
- Set the same env vars in the app settings

## Local run

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```

# 🖼️ Image Captioning with BLIP (COCO Subset)

## 📌 Problem

Generate natural language descriptions for images using transformer-based vision-language models.

Goal:
- Improve CIDEr score by 10%+
- Compare architectures (BLIP vs ViT-GPT2)
- Analyze resolution impact (224 vs 320 vs 384)
- Optimize decoding parameters
- Deploy minimal inference UI

---

## 📂 Dataset

- MS COCO Captions (subset: 10k & 20k)
- Random caption selection (5 captions per image)
- Experiments:
  - Short captions
  - Mixed captions
  - Filtered captions

Train/Validation split: 90/10

---

## 🧠 Models

### 1️⃣ BLIP (Primary Model)
- Salesforce/blip-image-captioning-base
- Vision encoder frozen (for efficiency)
- Gradient checkpointing enabled
- Mixed precision on MPS

### 2️⃣ ViT-GPT2 (Comparison)
- ViT base encoder
- GPT2 decoder with cross-attention

---

## 🧪 Experiments

### Resolution Comparison
| Resolution | Dataset | CIDEr |
|------------|---------|--------|
| 224px | 10k | ~1.28 |
| 320px | 20k | ~1.33–1.38 |
| 384px | 20k | ~1.40+ |

### Beam Search Tuning
Tested:
- Beams: 3, 5, 8
- Length penalty: 0.8, 1.0, 1.2
- Max length: 20, 30, 40

Best config:
Beams=5, MaxLen=20, LengthPenalty=1.0

---

## 📊 Evaluation Metric

- CIDEr (via pycocoevalcap)
- Validation loss
- Confidence estimation

---

## 🖥️ Demo

Streamlit app includes:
- Image uploader
- Beam controls
- Toxicity filtering
- Confidence display
- Attention heatmap

Run:
```bash
streamlit run app.py