Spaces:
Running
Running
File size: 3,748 Bytes
4482ecc a745a5e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
title: Image Captioning
emoji: 🖼️
colorFrom: indigo
colorTo: pink
sdk: streamlit
python_version: "3.10"
app_file: app.py
pinned: false
---
# Image Captioning (Streamlit)
This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models.
## Why your models should NOT be inside the app repo
Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when:
- the app repo stays small
- models live on the Hugging Face Hub (or S3/GCS)
- the app downloads models at startup (cached by `transformers`)
## 1) Upload your saved models to Hugging Face Hub
Example for BLIP (you already have `uploadtohf.py`):
```bash
pip install -U transformers huggingface_hub
huggingface-cli login
python uploadtohf.py
```
Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos.
## 2) Configure the app to load from Hub
`app.py` loads **local folders if present**, otherwise falls back to Hub IDs via environment variables:
- `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`)
In this repo, defaults are set to:
- `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`)
You can also override local folder names:
- `BLIP_LOCAL_DIR` (default: `saved_model_phase2`)
- `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`)
- `GIT_LOCAL_DIR` (default: `saved_git_model`)
## 3) Deploy options
### Option A: Hugging Face Spaces (recommended)
- Create a new Space: **Streamlit**
- Push this repo (must include `app.py` + `requirements.txt`)
- In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos
- If any model repo is private, add `HF_TOKEN` as a Space **Secret**
### Option B: Streamlit Community Cloud
- Point it to this repo
- Set the same env vars in the app settings
## Local run
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```
# 🖼️ Image Captioning with BLIP (COCO Subset)
## 📌 Problem
Generate natural language descriptions for images using transformer-based vision-language models.
Goal:
- Improve CIDEr score by 10%+
- Compare architectures (BLIP vs ViT-GPT2)
- Analyze resolution impact (224 vs 320 vs 384)
- Optimize decoding parameters
- Deploy minimal inference UI
---
## 📂 Dataset
- MS COCO Captions (subset: 10k & 20k)
- Random caption selection (5 captions per image)
- Experiments:
- Short captions
- Mixed captions
- Filtered captions
Train/Validation split: 90/10
---
## 🧠 Models
### 1️⃣ BLIP (Primary Model)
- Salesforce/blip-image-captioning-base
- Vision encoder frozen (for efficiency)
- Gradient checkpointing enabled
- Mixed precision on MPS
### 2️⃣ ViT-GPT2 (Comparison)
- ViT base encoder
- GPT2 decoder with cross-attention
---
## 🧪 Experiments
### Resolution Comparison
| Resolution | Dataset | CIDEr |
|------------|---------|--------|
| 224px | 10k | ~1.28 |
| 320px | 20k | ~1.33–1.38 |
| 384px | 20k | ~1.40+ |
### Beam Search Tuning
Tested:
- Beams: 3, 5, 8
- Length penalty: 0.8, 1.0, 1.2
- Max length: 20, 30, 40
Best config:
Beams=5, MaxLen=20, LengthPenalty=1.0
---
## 📊 Evaluation Metric
- CIDEr (via pycocoevalcap)
- Validation loss
- Confidence estimation
---
## 🖥️ Demo
Streamlit app includes:
- Image uploader
- Beam controls
- Toxicity filtering
- Confidence display
- Attention heatmap
Run:
```bash
streamlit run app.py |