Spaces:

pchandragrid
/

image_captioning

Running

App Files Files Community

image_captioning / README.md

pchandragrid

Add Spaces README metadata

4482ecc 2 days ago

preview code

raw

history blame contribute delete

3.75 kB

	---
	title: Image Captioning
	emoji: 🖼️
	colorFrom: indigo
	colorTo: pink
	sdk: streamlit
	python_version: "3.10"
	app_file: app.py
	pinned: false
	---

	# Image Captioning (Streamlit)

	This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models.

	## Why your models should NOT be inside the app repo

	Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when:

	- the app repo stays small
	- models live on the Hugging Face Hub (or S3/GCS)
	- the app downloads models at startup (cached by `transformers`)

	## 1) Upload your saved models to Hugging Face Hub

	Example for BLIP (you already have `uploadtohf.py`):

	```bash
	pip install -U transformers huggingface_hub
	huggingface-cli login
	python uploadtohf.py
	```

	Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos.

	## 2) Configure the app to load from Hub

	`app.py` loads local folders if present, otherwise falls back to Hub IDs via environment variables:

	- `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`)
	- `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`)
	- `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`)

	In this repo, defaults are set to:

	- `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`)
	- `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`)
	- `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`)

	You can also override local folder names:

	- `BLIP_LOCAL_DIR` (default: `saved_model_phase2`)
	- `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`)
	- `GIT_LOCAL_DIR` (default: `saved_git_model`)

	## 3) Deploy options

	### Option A: Hugging Face Spaces (recommended)

	- Create a new Space: Streamlit
	- Push this repo (must include `app.py` + `requirements.txt`)
	- In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos
	- If any model repo is private, add `HF_TOKEN` as a Space Secret

	### Option B: Streamlit Community Cloud

	- Point it to this repo
	- Set the same env vars in the app settings

	## Local run

	```bash
	python -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	streamlit run app.py
	```

	# 🖼️ Image Captioning with BLIP (COCO Subset)

	## 📌 Problem

	Generate natural language descriptions for images using transformer-based vision-language models.

	Goal:
	- Improve CIDEr score by 10%+
	- Compare architectures (BLIP vs ViT-GPT2)
	- Analyze resolution impact (224 vs 320 vs 384)
	- Optimize decoding parameters
	- Deploy minimal inference UI

	---

	## 📂 Dataset

	- MS COCO Captions (subset: 10k & 20k)
	- Random caption selection (5 captions per image)
	- Experiments:
	- Short captions
	- Mixed captions
	- Filtered captions

	Train/Validation split: 90/10

	---

	## 🧠 Models

	### 1️⃣ BLIP (Primary Model)
	- Salesforce/blip-image-captioning-base
	- Vision encoder frozen (for efficiency)
	- Gradient checkpointing enabled
	- Mixed precision on MPS

	### 2️⃣ ViT-GPT2 (Comparison)
	- ViT base encoder
	- GPT2 decoder with cross-attention

	---

	## 🧪 Experiments

	### Resolution Comparison
	\| Resolution \| Dataset \| CIDEr \|
	\|------------\|---------\|--------\|
	\| 224px \| 10k \| ~1.28 \|
	\| 320px \| 20k \| ~1.33–1.38 \|
	\| 384px \| 20k \| ~1.40+ \|

	### Beam Search Tuning
	Tested:
	- Beams: 3, 5, 8
	- Length penalty: 0.8, 1.0, 1.2
	- Max length: 20, 30, 40

	Best config:
	Beams=5, MaxLen=20, LengthPenalty=1.0

	---

	## 📊 Evaluation Metric

	- CIDEr (via pycocoevalcap)
	- Validation loss
	- Confidence estimation

	---

	## 🖥️ Demo

	Streamlit app includes:
	- Image uploader
	- Beam controls
	- Toxicity filtering
	- Confidence display
	- Attention heatmap

	Run:
	```bash
	streamlit run app.py