Spaces:

pavanpraneeth
/

CaptionIQ

Sleeping

File size: 3,795 Bytes

678d1bf
 
290f366
 
 
 
8b5e37f
caafd9a
678d1bf
 
 
290f366
678d1bf
8b5e37f
290f366
 
 
 
 
8b5e37f
 
 
 
 
 
 
 
 
 
290f366
 
 
 
 
 
8b5e37f
 
 
 
 
290f366
 
8b5e37f
290f366

---
title: CaptionIQ
emoji: 🧠
colorFrom: indigo
colorTo: purple
sdk: streamlit
sdk_version: 1.42.0
python_version: "3.10"
app_file: app.py
pinned: false
---
# 🧠 CaptionIQ — AI Image Captioning

> Generate natural language captions for images using VGG16/VGG19 + Bahdanau Attention LSTM on the Flickr8K dataset.

---

## ✨ Features

- **Dual CNN Backbones** — VGG16 and VGG19 for spatial feature extraction (7×7×512)
- **Bahdanau Attention LSTM** — Attends to specific image regions per word
- **Ensemble Mode (BLIP)** — High-quality captions from Salesforce BLIP model
- **Beam Search** — Top-5 diverse captions with confidence bars
- **🔥 Attention Heatmap** — Interactive word-by-word gradient saliency overlay
- **☁️ Word Cloud** — Live word distribution from beam candidates
- **🔄 Model Comparison** — VGG16 vs VGG19 vs Ensemble side-by-side with 🏆 winner
- **📋 Session History** — Track all generated captions, export as JSON/CSV
- **🎲 Surprise Me** — Random Flickr8K image with one click
- **BLEU Evaluation** — Per-image BLEU-1 through BLEU-4 scoring

---

## 🏗️ Architecture

```
Image → VGG16/19 block5_pool → (49 × 512) spatial map
                                      ↓
                          Bahdanau Attention
                                      ↓
Caption tokens → Embedding(256) → LSTM(512) → Softmax(vocab)
```


---

## 🚀 Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Preprocess Dataset

```bash
python src/preprocess.py
```

Downloads Flickr8K, cleans captions, builds vocabulary, creates train/val/test splits.

### 3. Extract Features

```bash
python src/extract_features.py --backbone both
```

Extracts 4096-d features from VGG16 and VGG19 (saved as `.pkl`).

### 4. Train Models

```bash
python src/train.py --backbone both --epochs 20
```

Trains both VGG16 and VGG19 captioning models. Saves checkpoints and loss plots.

### 5. Evaluate

```bash
python src/evaluate.py --backbone both
```

Computes BLEU-1 to BLEU-4 on the test set. Prints VGG16 vs VGG19 comparison table.

### 6. Generate Captions

```bash
python src/inference.py --image path/to/image.jpg --backbone vgg16
```

### 7. Launch Web App

```bash
streamlit run app.py
```

---

## 📁 Project Structure

```
├── data/                    # Dataset & preprocessed files
├── models/                  # Trained model checkpoints (.h5)
├── outputs/                 # Loss plots, BLEU results
├── src/
│   ├── config.py            # Paths & hyperparameters
│   ├── preprocess.py        # Caption cleaning & tokenization
│   ├── extract_features.py  # VGG feature extraction
│   ├── model.py             # CNN-LSTM architecture
│   ├── train.py             # Training with data generator
│   ├── inference.py         # Greedy & beam search
│   ├── evaluate.py          # BLEU score evaluation
│   └── utils.py             # Shared utilities
├── app.py                   # Streamlit web app
├── requirements.txt         # Dependencies
└── README.md
```

---

## 📊 Results

| Metric  | VGG16  | VGG19  |
|---------|--------|--------|
| BLEU-1  | —      | —      |
| BLEU-2  | —      | —      |
| BLEU-3  | —      | —      |
| BLEU-4  | —      | —      |

> Results will be populated after training and evaluation.

---

## 🛠️ Tech Stack

- **Deep Learning**: TensorFlow / Keras
- **Feature Extraction**: VGG16, VGG19 (ImageNet pretrained)
- **Text Processing**: NLTK, Keras Tokenizer
- **Evaluation**: NLTK BLEU
- **Web App**: Streamlit
- **Dataset**: Flickr8K (8,000 images, 5 captions each)

---

## 📄 License

This project is for educational purposes.