File size: 3,748 Bytes
4482ecc
 
 
 
 
 
 
 
 
 
 
a745a5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: Image Captioning
emoji: 🖼️
colorFrom: indigo
colorTo: pink
sdk: streamlit
python_version: "3.10"
app_file: app.py
pinned: false
---

# Image Captioning (Streamlit)

This repo hosts a Streamlit app (`app.py`) that compares multiple image-captioning models.

## Why your models should NOT be inside the app repo

Fine-tuned checkpoints are large. Public hosting (Hugging Face Spaces / Streamlit Cloud) works best when:

- the app repo stays small
- models live on the Hugging Face Hub (or S3/GCS)
- the app downloads models at startup (cached by `transformers`)

## 1) Upload your saved models to Hugging Face Hub

Example for BLIP (you already have `uploadtohf.py`):

```bash
pip install -U transformers huggingface_hub
huggingface-cli login
python uploadtohf.py
```

Do the same for your other local folders (`saved_vit_gpt2`, `saved_git_model`) by pushing them to separate Hub repos.

## 2) Configure the app to load from Hub

`app.py` loads **local folders if present**, otherwise falls back to Hub IDs via environment variables:

- `BLIP_MODEL_ID` (default: `prateekchandra/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `prateekchandra/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `prateekchandra/git-caption-model`)
 
In this repo, defaults are set to:
 
- `BLIP_MODEL_ID` (default: `pchandragrid/blip-caption-model`)
- `VITGPT2_MODEL_ID` (default: `pchandragrid/vit-gpt2-caption-model`)
- `GIT_MODEL_ID` (default: `pchandragrid/git-caption-model`)

You can also override local folder names:

- `BLIP_LOCAL_DIR` (default: `saved_model_phase2`)
- `VITGPT2_LOCAL_DIR` (default: `saved_vit_gpt2`)
- `GIT_LOCAL_DIR` (default: `saved_git_model`)

## 3) Deploy options

### Option A: Hugging Face Spaces (recommended)

- Create a new Space: **Streamlit**
- Push this repo (must include `app.py` + `requirements.txt`)
- In Space “Variables”, set `BLIP_MODEL_ID`, `VITGPT2_MODEL_ID`, `GIT_MODEL_ID` to your Hub repos
- If any model repo is private, add `HF_TOKEN` as a Space **Secret**

### Option B: Streamlit Community Cloud

- Point it to this repo
- Set the same env vars in the app settings

## Local run

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```

# 🖼️ Image Captioning with BLIP (COCO Subset)

## 📌 Problem

Generate natural language descriptions for images using transformer-based vision-language models.

Goal:
- Improve CIDEr score by 10%+
- Compare architectures (BLIP vs ViT-GPT2)
- Analyze resolution impact (224 vs 320 vs 384)
- Optimize decoding parameters
- Deploy minimal inference UI

---

## 📂 Dataset

- MS COCO Captions (subset: 10k & 20k)
- Random caption selection (5 captions per image)
- Experiments:
  - Short captions
  - Mixed captions
  - Filtered captions

Train/Validation split: 90/10

---

## 🧠 Models

### 1️⃣ BLIP (Primary Model)
- Salesforce/blip-image-captioning-base
- Vision encoder frozen (for efficiency)
- Gradient checkpointing enabled
- Mixed precision on MPS

### 2️⃣ ViT-GPT2 (Comparison)
- ViT base encoder
- GPT2 decoder with cross-attention

---

## 🧪 Experiments

### Resolution Comparison
| Resolution | Dataset | CIDEr |
|------------|---------|--------|
| 224px | 10k | ~1.28 |
| 320px | 20k | ~1.33–1.38 |
| 384px | 20k | ~1.40+ |

### Beam Search Tuning
Tested:
- Beams: 3, 5, 8
- Length penalty: 0.8, 1.0, 1.2
- Max length: 20, 30, 40

Best config:
Beams=5, MaxLen=20, LengthPenalty=1.0

---

## 📊 Evaluation Metric

- CIDEr (via pycocoevalcap)
- Validation loss
- Confidence estimation

---

## 🖥️ Demo

Streamlit app includes:
- Image uploader
- Beam controls
- Toxicity filtering
- Confidence display
- Attention heatmap

Run:
```bash
streamlit run app.py