OmarGamal48812
/

flickr-captioning

@@ -6,8 +6,10 @@ tags:
   - pytorch
   - resnet
   - attention
-  - lstm
   - flickr8k
   - show-attend-and-tell
 datasets:
   - nlphuji/flickr8k
@@ -20,78 +22,74 @@ library_name: pytorch
 pipeline_tag: image-to-text
 ---
-# Flickr8k Image Captioning — ResNet50 + Bahdanau Attention + LSTM Decoder
 This model generates a natural-language description of an image. It uses a
 **ResNet50** spatial-feature encoder, a **Bahdanau (additive)** attention
-module, and an **LSTM decoder**, trained with teacher forcing and doubly
-stochastic regularization on the **Flickr8k** dataset (8,091 images × 5
-captions). It is the reference architecture from
-[*Show, Attend and Tell* (Xu et al., 2015)](https://arxiv.org/abs/1502.03044).
 ## Test-set performance (beam search, k = 5)
 | Metric | Value |
 |---|---|
-| BLEU-1 | 0.6488 |
-| BLEU-2 | 0.4714 |
-| BLEU-3 | 0.3378 |
-| **BLEU-4** | **0.2403** |
-| METEOR | 0.4270 |
-| CIDEr | 0.6002 |
-| ROUGE-L | 0.4788 |
-Greedy decoding scores: BLEU-4 = 0.2073, METEOR = 0.4119, CIDEr = 0.5322.
-Evaluated on the held-out 1,091-image test split (image-level split — no
-captions cross train/val/test). Beam search uses length-normalized log-probs
-(`alpha = 0.7`) and a repetition penalty of `1.2`.
 ## Architecture
 ```
 Image (3, 224, 224)
-  └─ ResNet50 (pretrained, frozen first 15 epochs, last 2 blocks fine-tuned)
        output: (B, 2048, 7, 7)  → reshape to (B, 49, 2048)
   └─ Bahdanau attention  V·tanh(W_enc(features) + W_dec(h_prev))
        output: context vector (B, 2048), attention weights (B, 49)
-  └─ LSTMCell  (per timestep — re-queries attention each step)
-       hidden state size: 512, embedding size: 256
-  └─ Linear → vocab logits (V = 2,557)
 ```
-Total parameters: **~36 M** (28 M frozen ResNet, 8 M trainable decoder/projection).
 ## Training details
-- **Loss** — `CrossEntropyLoss(ignore_index=0)` plus doubly-stochastic
-  regularization `α_c · ((1 − Σ_t α_t)²).mean()` with `α_c = 1.0`
-- **Optimizer** — Adam, decoder LR `4e-4`, encoder LR `1e-5` (Phase B)
-- **Schedule** — `ReduceLROnPlateau` on val BLEU-4, `factor=0.5`,
-  `patience=3`
-- **Two-phase training** — Phase A (15 epochs): freeze CNN, train decoder
-  only. Phase B (10 epochs): unfreeze last 2 ResNet blocks.
-- **Vocabulary** — 2,557 tokens (frequency threshold 5), built from train
-  captions only. Special tokens: `<pad>=0, <start>=1, <end>=2, <unk>=3`.
-- **Batch size** — 32, gradient clip 5.0
-- **Seed** — 42
 ## Files in this repo
-- `attention_lstm.pth` — PyTorch checkpoint (encoder + decoder state
-  dicts, optimizer state, training config)
 - `vocab.pkl` — pickled `Vocabulary` object built from the train split
 - `config.json` — JSON copy of the training hyperparameters
-- `metrics_beam5.json`, `metrics_greedy.json` — full test-set metrics
 ## Usage
-The cleanest way to use this model is to clone the source repo so the
-`Vocabulary`, encoder, and decoder classes are importable:
 ```bash
-git clone https://github.com/OmarGamal488/flickr8k-image-captioning.git
-cd flickr8k-image-captioning
 uv sync
 ```
@@ -104,7 +102,7 @@ from src.inference import load_attention_model, caption_image
 from src.utils import get_device
 repo_id = "OmarGamal48812/flickr8k-attention-lstm"
-ckpt_path = hf_hub_download(repo_id=repo_id, filename="attention_lstm.pth")
 vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.pkl")
 device = get_device()
@@ -118,36 +116,25 @@ caption, beams = caption_image(
     method="beam", beam_width=5,
 )
 print(caption)
 ```
-For interactive use, the same repo ships a Gradio demo (`app.py`) and a
-FastAPI service (`api/main.py`).
 ## Limitations
-- **Small training set.** Flickr8k has only 6,000 training images, so the
-  model often falls back to "safe" generic captions (e.g. *a dog runs through
-  the grass*) for unfamiliar scenes.
-- **Vocabulary cap.** Words seen fewer than 5 times in the train split
-  collapse to `<unk>`. Rare nouns and proper names are systematically lost.
-- **Domain.** Trained exclusively on Flickr8k photos (mostly people, dogs,
-  outdoor scenes). Performance degrades on cartoons, screenshots, abstract
-  imagery, and any scene type not represented in Flickr8k.
-- **Hallucinations.** Like all autoregressive captioners, the decoder can
-  insert objects that aren't in the image when attention drifts.
-- **English only.** Vocabulary and grammar are entirely English Flickr8k
-  captions.
-## Intended use
-Educational demonstrations of the Show-Attend-Tell architecture and
-research baselines. Not appropriate as the only data source for
-accessibility tooling (alt-text generation should ideally use a model
-trained on a much larger dataset).
 ## Citation
-If you use this checkpoint, please credit the underlying paper:
 ```bibtex
 @inproceedings{xu2015show,
@@ -157,15 +144,19 @@ If you use this checkpoint, please credit the underlying paper:
   booktitle = {ICML},
   year      = {2015}
 }
-```
-and the dataset:
-```bibtex
-@article{hodosh2013framing,
-  title   = {Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics},
-  author  = {Hodosh, Micah and Young, Peter and Hockenmaier, Julia},
-  journal = {Journal of Artificial Intelligence Research},
-  year    = {2013}
 }
 ```

   - pytorch
   - resnet
   - attention
+  - gru
+  - glove
   - flickr8k
+  - flickr30k
   - show-attend-and-tell
 datasets:
   - nlphuji/flickr8k
 pipeline_tag: image-to-text
 ---
+# Flickr Image Captioning — ResNet50 + Bahdanau Attention + GRU + GloVe
 This model generates a natural-language description of an image. It uses a
 **ResNet50** spatial-feature encoder, a **Bahdanau (additive)** attention
+module, and a **GRU decoder** initialized with **GloVe 6B 300d** embeddings,
+trained on the merged **Flickr8k + Flickr30k** dataset (39,874 images × 5
+captions). It follows the architecture from
+[*Show, Attend and Tell* (Xu et al., 2015)](https://arxiv.org/abs/1502.03044)
+with label smoothing, scheduled sampling, and two-phase CNN fine-tuning.
 ## Test-set performance (beam search, k = 5)
+Evaluated on the held-out 1,873-image test split (image-level split — no
+captions cross train/val/test).
 | Metric | Value |
 |---|---|
+| BLEU-1 | 0.6859 |
+| BLEU-2 | 0.5289 |
+| BLEU-3 | 0.4041 |
+| **BLEU-4** | **0.3093** |
+| METEOR | 0.4709 |
+| CIDEr | 0.7961 |
+| ROUGE-L | 0.5257 |
+Beam search uses length-normalized log-probs (`alpha = 0.7`) and a
+repetition penalty of `1.2`.
 ## Architecture
 ```
 Image (3, 224, 224)
+  └─ ResNet50 (pretrained, frozen first 10 epochs, last 2 blocks fine-tuned)
        output: (B, 2048, 7, 7)  → reshape to (B, 49, 2048)
   └─ Bahdanau attention  V·tanh(W_enc(features) + W_dec(h_prev))
        output: context vector (B, 2048), attention weights (B, 49)
+  └─ GRUCell  (per timestep — re-queries attention each step)
+       hidden state size: 1024, embedding size: 300 (GloVe 6B 300d)
+  └─ Linear → vocab logits (V = 10,111)
 ```
+Total parameters: **~37 M** (25 M frozen ResNet, 12 M trainable decoder/projection).
 ## Training details
+- **Dataset** — Flickr8k + Flickr30k merged (37,000 train / 1,000 val / 1,873 test)
+- **Vocabulary** — 10,111 tokens (frequency threshold 3), built from train
+  captions only. Special tokens: `<pad>=0, <start>=1, <end>=2, `<unk>=3`.
+- **Loss** — `CrossEntropyLoss(ignore_index=0, label_smoothing=0.1)` plus
+  doubly-stochastic regularization `α_c · ((1 − Σ_t α_t)²).mean()` with `α_c = 1.0`
+- **Optimizer** — Adam, decoder LR `3.2e-3`, encoder LR `8e-5` (Phase B)
+- **Schedule** — `ReduceLROnPlateau` on val BLEU-4, `factor=0.5`, `patience=3`
+- **Two-phase training** — Phase A (epochs 1–10): freeze CNN. Phase B (epochs 11–35): unfreeze last 2 ResNet blocks.
+- **Scheduled sampling** — linear ramp from 0 to max 0.25 over training epochs
+- **Batch size** — 256, gradient clip 5.0, seed 42
 ## Files in this repo
+- `attention_gru_glove.pth` — PyTorch checkpoint (encoder + decoder state dicts, config)
 - `vocab.pkl` — pickled `Vocabulary` object built from the train split
 - `config.json` — JSON copy of the training hyperparameters
+- `metrics_beam5.json` — full test-set metrics (beam search k=5)
 ## Usage
 ```bash
+git clone https://github.com/OmarGamal488/flickr-image-captioning.git
+cd flickr-image-captioning
 uv sync
 ```
 from src.utils import get_device
 repo_id = "OmarGamal48812/flickr8k-attention-lstm"
+ckpt_path = hf_hub_download(repo_id=repo_id, filename="attention_gru_glove.pth")
 vocab_path = hf_hub_download(repo_id=repo_id, filename="vocab.pkl")
 device = get_device()
     method="beam", beam_width=5,
 )
 print(caption)
+for b in beams[:3]:
+    print(f"  {b.score:+.3f}  {b.caption}")
 ```
 ## Limitations
+- **Domain.** Trained on Flickr8k + Flickr30k photos (mostly people, dogs,
+  outdoor scenes). Performance degrades on cartoons, screenshots, and abstract imagery.
+- **Safe-word bias.** Only 8.8% of the 10,111-word vocabulary is used at inference —
+  the decoder converges on template phrases like *"a man in a white shirt is standing"*.
+- **No object counting.** The attention context vector collapses object count —
+  the model often says "a dog" when the image shows two dogs.
+- **Hallucinations.** The decoder can insert objects not in the image when visual
+  evidence is weak and the language-model prior takes over.
+- **English only.** Vocabulary and grammar are entirely from English Flickr captions.
 ## Citation
+If you use this checkpoint, please cite the three papers this work builds on:
 ```bibtex
 @inproceedings{xu2015show,
   booktitle = {ICML},
   year      = {2015}
 }
+@article{bahdanau2014neural,
+  title   = {Neural Machine Translation by Jointly Learning to Align and Translate},
+  author  = {Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua},
+  journal = {arXiv preprint arXiv:1409.0473},
+  year    = {2014}
+}
+@inproceedings{selvaraju2017gradcam,
+  title     = {Grad-{CAM}: Visual Explanations from Deep Networks via Gradient-based Localization},
+  author    = {Selvaraju, Ramprasaath R. and Cogswell, Michael and Das, Abhishek and
+               Vedantam, Ramakrishna and Parikh, Devi and Batra, Dhruv},
+  booktitle = {ICCV},
+  year      = {2017}
 }
 ```