File size: 3,795 Bytes
678d1bf
 
290f366
 
 
 
8b5e37f
caafd9a
678d1bf
 
 
290f366
678d1bf
8b5e37f
290f366
 
 
 
 
8b5e37f
 
 
 
 
 
 
 
 
 
290f366
 
 
 
 
 
8b5e37f
 
 
 
 
290f366
 
8b5e37f
290f366
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: CaptionIQ
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: streamlit
sdk_version: 1.42.0
python_version: "3.10"
app_file: app.py
pinned: false
---
# 🧠 CaptionIQ β€” AI Image Captioning

> Generate natural language captions for images using VGG16/VGG19 + Bahdanau Attention LSTM on the Flickr8K dataset.

---

## ✨ Features

- **Dual CNN Backbones** β€” VGG16 and VGG19 for spatial feature extraction (7Γ—7Γ—512)
- **Bahdanau Attention LSTM** β€” Attends to specific image regions per word
- **Ensemble Mode (BLIP)** β€” High-quality captions from Salesforce BLIP model
- **Beam Search** β€” Top-5 diverse captions with confidence bars
- **πŸ”₯ Attention Heatmap** β€” Interactive word-by-word gradient saliency overlay
- **☁️ Word Cloud** β€” Live word distribution from beam candidates
- **πŸ”„ Model Comparison** β€” VGG16 vs VGG19 vs Ensemble side-by-side with πŸ† winner
- **πŸ“‹ Session History** β€” Track all generated captions, export as JSON/CSV
- **🎲 Surprise Me** β€” Random Flickr8K image with one click
- **BLEU Evaluation** β€” Per-image BLEU-1 through BLEU-4 scoring

---

## πŸ—οΈ Architecture

```
Image β†’ VGG16/19 block5_pool β†’ (49 Γ— 512) spatial map
                                      ↓
                          Bahdanau Attention
                                      ↓
Caption tokens β†’ Embedding(256) β†’ LSTM(512) β†’ Softmax(vocab)
```


---

## πŸš€ Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Preprocess Dataset

```bash
python src/preprocess.py
```

Downloads Flickr8K, cleans captions, builds vocabulary, creates train/val/test splits.

### 3. Extract Features

```bash
python src/extract_features.py --backbone both
```

Extracts 4096-d features from VGG16 and VGG19 (saved as `.pkl`).

### 4. Train Models

```bash
python src/train.py --backbone both --epochs 20
```

Trains both VGG16 and VGG19 captioning models. Saves checkpoints and loss plots.

### 5. Evaluate

```bash
python src/evaluate.py --backbone both
```

Computes BLEU-1 to BLEU-4 on the test set. Prints VGG16 vs VGG19 comparison table.

### 6. Generate Captions

```bash
python src/inference.py --image path/to/image.jpg --backbone vgg16
```

### 7. Launch Web App

```bash
streamlit run app.py
```

---

## πŸ“ Project Structure

```
β”œβ”€β”€ data/                    # Dataset & preprocessed files
β”œβ”€β”€ models/                  # Trained model checkpoints (.h5)
β”œβ”€β”€ outputs/                 # Loss plots, BLEU results
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py            # Paths & hyperparameters
β”‚   β”œβ”€β”€ preprocess.py        # Caption cleaning & tokenization
β”‚   β”œβ”€β”€ extract_features.py  # VGG feature extraction
β”‚   β”œβ”€β”€ model.py             # CNN-LSTM architecture
β”‚   β”œβ”€β”€ train.py             # Training with data generator
β”‚   β”œβ”€β”€ inference.py         # Greedy & beam search
β”‚   β”œβ”€β”€ evaluate.py          # BLEU score evaluation
β”‚   └── utils.py             # Shared utilities
β”œβ”€β”€ app.py                   # Streamlit web app
β”œβ”€β”€ requirements.txt         # Dependencies
└── README.md
```

---

## πŸ“Š Results

| Metric  | VGG16  | VGG19  |
|---------|--------|--------|
| BLEU-1  | β€”      | β€”      |
| BLEU-2  | β€”      | β€”      |
| BLEU-3  | β€”      | β€”      |
| BLEU-4  | β€”      | β€”      |

> Results will be populated after training and evaluation.

---

## πŸ› οΈ Tech Stack

- **Deep Learning**: TensorFlow / Keras
- **Feature Extraction**: VGG16, VGG19 (ImageNet pretrained)
- **Text Processing**: NLTK, Keras Tokenizer
- **Evaluation**: NLTK BLEU
- **Web App**: Streamlit
- **Dataset**: Flickr8K (8,000 images, 5 captions each)

---

## πŸ“„ License

This project is for educational purposes.