Spaces:
Running
Running
| title: Image Captioning | |
| emoji: 🖼️ | |
| colorFrom: indigo | |
| colorTo: pink | |
| sdk: streamlit | |
| python_version: "3.10" | |
| app_file: app.py | |
| pinned: false | |
| # Image Captioning (Streamlit) | |
| This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models. | |
| ## Why your models should NOT be inside the app repo | |
| Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when: | |
| - the app repo stays small | |
| - models live on the Hugging Face Hub (or S3/GCS) | |
| - the app downloads models at startup (cached by `transformers`) | |
| ## 1) Upload your saved models to Hugging Face Hub | |
| Example for BLIP (you already have `uploadtohf.py`): | |
| ```bash | |
| pip install -U transformers huggingface_hub | |
| huggingface-cli login | |
| python uploadtohf.py | |
| ``` | |
| Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos. | |
| ## 2) Configure the app to load from Hub | |
| `app.py` loads **local folders if present**, otherwise falls back to Hub IDs via environment variables: | |
| - `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`) | |
| - `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`) | |
| - `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`) | |
| In this repo, defaults are set to: | |
| - `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`) | |
| - `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`) | |
| - `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`) | |
| You can also override local folder names: | |
| - `BLIP_LOCAL_DIR` (default: `saved_model_phase2`) | |
| - `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`) | |
| - `GIT_LOCAL_DIR` (default: `saved_git_model`) | |
| ## 3) Deploy options | |
| ### Option A: Hugging Face Spaces (recommended) | |
| - Create a new Space: **Streamlit** | |
| - Push this repo (must include `app.py` + `requirements.txt`) | |
| - In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos | |
| - If any model repo is private, add `HF_TOKEN` as a Space **Secret** | |
| ### Option B: Streamlit Community Cloud | |
| - Point it to this repo | |
| - Set the same env vars in the app settings | |
| ## Local run | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| streamlit run app.py | |
| ``` | |
| # 🖼️ Image Captioning with BLIP (COCO Subset) | |
| ## 📌 Problem | |
| Generate natural language descriptions for images using transformer-based vision-language models. | |
| Goal: | |
| - Improve CIDEr score by 10%+ | |
| - Compare architectures (BLIP vs ViT-GPT2) | |
| - Analyze resolution impact (224 vs 320 vs 384) | |
| - Optimize decoding parameters | |
| - Deploy minimal inference UI | |
| --- | |
| ## 📂 Dataset | |
| - MS COCO Captions (subset: 10k & 20k) | |
| - Random caption selection (5 captions per image) | |
| - Experiments: | |
| - Short captions | |
| - Mixed captions | |
| - Filtered captions | |
| Train/Validation split: 90/10 | |
| --- | |
| ## 🧠 Models | |
| ### 1️⃣ BLIP (Primary Model) | |
| - Salesforce/blip-image-captioning-base | |
| - Vision encoder frozen (for efficiency) | |
| - Gradient checkpointing enabled | |
| - Mixed precision on MPS | |
| ### 2️⃣ ViT-GPT2 (Comparison) | |
| - ViT base encoder | |
| - GPT2 decoder with cross-attention | |
| --- | |
| ## 🧪 Experiments | |
| ### Resolution Comparison | |
| | Resolution | Dataset | CIDEr | | |
| |------------|---------|--------| | |
| | 224px | 10k | ~1.28 | | |
| | 320px | 20k | ~1.33–1.38 | | |
| | 384px | 20k | ~1.40+ | | |
| ### Beam Search Tuning | |
| Tested: | |
| - Beams: 3, 5, 8 | |
| - Length penalty: 0.8, 1.0, 1.2 | |
| - Max length: 20, 30, 40 | |
| Best config: | |
| Beams=5, MaxLen=20, LengthPenalty=1.0 | |
| --- | |
| ## 📊 Evaluation Metric | |
| - CIDEr (via pycocoevalcap) | |
| - Validation loss | |
| - Confidence estimation | |
| --- | |
| ## 🖥️ Demo | |
| Streamlit app includes: | |
| - Image uploader | |
| - Beam controls | |
| - Toxicity filtering | |
| - Confidence display | |
| - Attention heatmap | |
| Run: | |
| ```bash | |
| streamlit run app.py |