Spaces:
Configuration error
Configuration error
Commit Β·
2ab9a5b
1
Parent(s): 3a2e5f0
docs: professionalize repository presentation and architecture documentation
Browse files
README.md
CHANGED
|
@@ -1,209 +1,388 @@
|
|
| 1 |
-
#
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
|
| 4 |
-
|
| 5 |
-

|
| 6 |
-

|
| 7 |
-

|
| 8 |
-

|
| 9 |
-

|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
[
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
- COCO dataset integration
|
| 29 |
-
- Transformer-based caption generation
|
| 30 |
-
- GPU-enabled execution
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
-
#
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
---
|
| 43 |
|
| 44 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
- Leveraged **InceptionV3** for visual feature extraction
|
| 48 |
-
- Implemented **attention-based sequence generation**
|
| 49 |
-
- Achieved improved caption quality using **BLEU evaluation**
|
| 50 |
-
- Compared multiple CNN backbones (VGG, ResNet, Inception)
|
| 51 |
|
| 52 |
---
|
| 53 |
|
| 54 |
-
##
|
| 55 |
|
| 56 |
-
|
| 57 |
-
-
|
| 58 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
##
|
| 65 |
-
- **InceptionV3 (CNN)**
|
| 66 |
-
- Extracts high-level spatial features from images
|
| 67 |
-
- Converts image β feature vector
|
| 68 |
|
| 69 |
-
|
| 70 |
-
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
---
|
| 75 |
|
| 76 |
-
#
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
-
###
|
| 85 |
-
**Generated Caption:**
|
| 86 |
-
`a man is standing on a beach with a surfboard`
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
-
##
|
| 94 |
-
**Generated Caption:**
|
| 95 |
-
`a man riding a motorcycle on a street`
|
| 96 |
-
*<img width="832" height="857" alt="image" src="https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd" />
|
| 97 |
-
*
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
-
- Generates **semantically meaningful captions**
|
| 111 |
-
- Performs well on **common objects and scenes**
|
| 112 |
-
- Slight limitations on **complex multi-object scenes**
|
| 113 |
|
| 114 |
-
-
|
| 115 |
|
| 116 |
-
|
| 117 |
|
| 118 |
-
|
| 119 |
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
-
|
| 123 |
-
- 80 object categories
|
| 124 |
-
- Multiple captions per image
|
| 125 |
-
- Rich annotations for training
|
| 126 |
|
| 127 |
---
|
| 128 |
|
| 129 |
-
#
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
---
|
| 143 |
|
| 144 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
-
|
| 147 |
-
- TensorFlow / Keras
|
| 148 |
-
- CNN (InceptionV3)
|
| 149 |
-
- Transformer Architecture
|
| 150 |
-
- NumPy, Pandas
|
| 151 |
-
- Matplotlib
|
| 152 |
-
- Jupyter Notebook
|
| 153 |
|
| 154 |
---
|
| 155 |
|
| 156 |
-
#
|
| 157 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 158 |
```
|
| 159 |
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
---
|
| 168 |
|
| 169 |
-
#
|
| 170 |
|
| 171 |
-
This
|
| 172 |
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
-
|
| 178 |
-
-
|
| 179 |
-
-
|
| 180 |
-
-
|
|
|
|
|
|
|
| 181 |
|
| 182 |
---
|
| 183 |
|
| 184 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
-
|
| 187 |
-
- May generate generic captions for rare objects
|
| 188 |
-
- Requires large datasets and compute for training
|
| 189 |
|
| 190 |
---
|
| 191 |
|
| 192 |
-
#
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
---
|
| 201 |
|
| 202 |
-
#
|
| 203 |
|
| 204 |
-
|
| 205 |
-
AI Systems Engineer | Deep Learning | ML Infrastructure
|
| 206 |
|
| 207 |
---
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Image Captioning System
|
| 2 |
+
|
| 3 |
+
> CNN + Transformer architecture for visual-to-language generation, restructured from an IEEE-published research notebook into a production-style multimodal AI codebase.
|
| 4 |
+
|
| 5 |
+
<p align="left">
|
| 6 |
+
<img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-3776AB?logo=python&logoColor=white">
|
| 7 |
+
<img alt="TensorFlow 2.15" src="https://img.shields.io/badge/TensorFlow-2.15-FF6F00?logo=tensorflow&logoColor=white">
|
| 8 |
+
<img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white">
|
| 9 |
+
<img alt="FastAPI ready" src="https://img.shields.io/badge/FastAPI-ready-009688?logo=fastapi&logoColor=white">
|
| 10 |
+
</p>
|
| 11 |
+
|
| 12 |
+
<p align="left">
|
| 13 |
+
<img alt="Ruff" src="https://img.shields.io/badge/lint-ruff-261230?logo=ruff&logoColor=white">
|
| 14 |
+
<img alt="mypy" src="https://img.shields.io/badge/typed-mypy-1F5082">
|
| 15 |
+
<img alt="Tests" src="https://img.shields.io/badge/tests-37%20passing-brightgreen">
|
| 16 |
+
<img alt="Pre-commit" src="https://img.shields.io/badge/pre--commit-enabled-FAB040?logo=pre-commit&logoColor=white">
|
| 17 |
+
</p>
|
| 18 |
+
|
| 19 |
+
<p align="left">
|
| 20 |
+
<img alt="IEEE Published" src="https://img.shields.io/badge/IEEE-published-00629B?logo=ieee&logoColor=white">
|
| 21 |
+
<img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-lightgrey">
|
| 22 |
+
<img alt="Status" src="https://img.shields.io/badge/status-Phase%201%20complete-blue">
|
| 23 |
+
</p>
|
| 24 |
|
| 25 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
## Overview
|
| 28 |
|
| 29 |
+
This repository implements an **end-to-end image-captioning pipeline** built around an InceptionV3 visual encoder and a custom multi-head Transformer decoder. The architecture is the basis of the IEEE-published paper *βAI Narratives: Bridging Visual Content and Linguistic Expressionβ*; this codebase lifts the original Kaggle research notebook into a typed, tested, configuration-driven Python package that can be reused from CLI, scripts, or a future serving layer.
|
| 30 |
|
| 31 |
+
The repository is structured in deliberate phases:
|
| 32 |
|
| 33 |
+
| Phase | Focus | Status |
|
| 34 |
+
|---|---|---|
|
| 35 |
+
| 0 β Bootstrap | Tooling, packaging, freeze policy | β
complete |
|
| 36 |
+
| 1 β Modularisation | Notebook β typed Python package, parity audit, unit tests | β
complete |
|
| 37 |
+
| 2 β Serving | FastAPI inference API + HuggingFace Spaces deploy | β³ planned |
|
| 38 |
+
| 3 β Multimodal baselines | BLIP / ViT-GPT2 / GIT side-by-side comparison | β³ planned |
|
| 39 |
+
| 4 β Observability | Sentry, Prometheus metrics, ADRs | β³ planned |
|
| 40 |
|
| 41 |
+
Phase notes live under [`docs/`](docs/): [restructure plan](docs/restructure-plan.md) Β· [Phase 0 notes](docs/PHASE_0_NOTES.md) Β· [Phase 1 notes](docs/PHASE_1_NOTES.md).
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
|
| 45 |
+
## Research backing
|
| 46 |
|
| 47 |
+
The model architecture and the BLEU-4 ~24 baseline below come from the IEEE paper and its accompanying notebook:
|
| 48 |
|
| 49 |
+
- **Paper:** [AI Narratives: Bridging Visual Content and Linguistic Expression](https://ieeexplore.ieee.org/document/10675203) (IEEE)
|
| 50 |
+
- **Original notebook:** [Kaggle β image-captioning-using-dl](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
|
| 51 |
+
- **Frozen artefact in this repo:** [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](notebooks/01_ieee_inceptionv3_transformer.ipynb) β byte-stable; CI enforces its SHA-256.
|
| 52 |
|
| 53 |
+
The notebook is preserved verbatim as the canonical research artefact. Improvements happen in the modular package; the notebook does not.
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
---
|
| 56 |
|
| 57 |
+
## Architecture
|
| 58 |
|
| 59 |
+
```
|
| 60 |
+
ββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββ
|
| 61 |
+
β Input image ββββΆβ InceptionV3 ββββΆβ Transformer ββββΆβ Transformer ββββΆβ Caption β
|
| 62 |
+
β 299x299x3 β β encoder β β encoder β β decoder β β string β
|
| 63 |
+
ββββββββββββββββ β (ImageNet, β β (1 layer, β β (2 layers, β ββββββββββββββ
|
| 64 |
+
β frozen) β β 1 head) β β 8 heads) β
|
| 65 |
+
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
|
| 66 |
+
βΌ βΌ βΌ
|
| 67 |
+
[B, 64, 2048] [B, 64, 512] [B, T, vocab]
|
| 68 |
+
patch features projected features softmax over 15k tokens
|
| 69 |
+
```
|
| 70 |
|
| 71 |
+
### Components
|
| 72 |
|
| 73 |
+
- **CNN encoder** β [`models/encoder_cnn.py`](src/captioning/models/encoder_cnn.py). Pretrained InceptionV3 with the classification head removed; output reshaped to a sequence of 64 spatial positions Γ 2048 channels. Weights are frozen during training.
|
| 74 |
+
- **Transformer encoder** β [`models/transformer_encoder.py`](src/captioning/models/transformer_encoder.py). Single layer with one attention head. Projects InceptionV3 features into the decoderβs embedding dimension and lets the decoder attend across spatial positions.
|
| 75 |
+
- **Embeddings** β [`models/embeddings.py`](src/captioning/models/embeddings.py). Sum of token and *learned* positional embeddings (not sinusoidal β preserved from the published architecture).
|
| 76 |
+
- **Transformer decoder** β [`models/transformer_decoder.py`](src/captioning/models/transformer_decoder.py). Causal self-attention over partial captions, cross-attention over image features, and a feed-forward sub-block. 8 heads, ``embedding_dim=512``, dropouts (0.1 / 0.3 / 0.5) preserved from the IEEE configuration.
|
| 77 |
+
- **Captioning model** β [`models/captioning_model.py`](src/captioning/models/captioning_model.py). Custom `train_step` / `test_step` with masked sparse-categorical cross-entropy and masked accuracy.
|
| 78 |
+
- **Tokenizer** β [`preprocessing/tokenizer.py`](src/captioning/preprocessing/tokenizer.py). `CaptionTokenizer` wraps `tf.keras.layers.TextVectorization`; persists the vocabulary as both pickle (notebook-compatible) and JSON sidecar.
|
| 79 |
+
- **Inference** β [`inference/predictor.py`](src/captioning/inference/predictor.py). `CaptionPredictor.from_artifacts(weights, vocab, config)` loads everything once at boot, exposes `predict_path(...)` and `predict_tensor(...)` for stateless calls, and `warmup()` for first-request latency.
|
| 80 |
+
- **Configuration** β [`config/schema.py`](src/captioning/config/schema.py). Pydantic v2 schemas (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`); strict (`extra="forbid"`) so typos in YAML or env vars become load-time errors instead of silent drift.
|
| 81 |
|
| 82 |
---
|
| 83 |
|
| 84 |
+
## Sample outputs
|
| 85 |
+
|
| 86 |
+
| Image | Generated caption |
|
| 87 |
+
|---|---|
|
| 88 |
+
|  | *a man is standing on a beach with a surfboard* |
|
| 89 |
+
|  | *a man riding a motorcycle on a street* |
|
| 90 |
|
| 91 |
+
Outputs above are from the IEEE notebook; the modular pipeline reproduces these via the parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
+
## Performance
|
| 96 |
|
| 97 |
+
| Metric | Value | Source |
|
| 98 |
+
|---|---|---|
|
| 99 |
+
| BLEU-4 | ~24 | Reported in the IEEE paper / Kaggle notebook |
|
| 100 |
+
| Vocabulary size | 15,000 tokens | TextVectorization adapt over preprocessed COCO captions |
|
| 101 |
+
| Training set | ~120k captions sampled from COCO 2017 | `data.sample_size` in [`configs/base.yaml`](configs/base.yaml) |
|
| 102 |
+
| Image resolution | 299 Γ 299 (InceptionV3) | [`preprocessing/image.py`](src/captioning/preprocessing/image.py) |
|
| 103 |
+
| Max caption length | 40 tokens | `model.max_length` in [`configs/base.yaml`](configs/base.yaml) |
|
| 104 |
|
| 105 |
+
> Re-training on the modular pipeline is a Phase 2 deliverable; once a fresh checkpoint exists, this table will be expanded with corpus BLEU-1..4, CIDEr, METEOR, and ROUGE-L (already implemented in [`evaluation/`](src/captioning/evaluation/)).
|
| 106 |
|
| 107 |
+
---
|
| 108 |
|
| 109 |
+
## Project structure
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
+
```
|
| 112 |
+
image-captioning-system/
|
| 113 |
+
βββ notebooks/
|
| 114 |
+
β βββ 01_ieee_inceptionv3_transformer.ipynb # FROZEN β IEEE research artefact
|
| 115 |
+
β βββ README.md # Frozen-notebook policy
|
| 116 |
+
β
|
| 117 |
+
βββ src/captioning/ # Installable package
|
| 118 |
+
β βββ config/ schema.py Β· loader.py
|
| 119 |
+
β βββ preprocessing/ caption.py Β· image.py Β· tokenizer.py Β· augmentation.py
|
| 120 |
+
β βββ data/ coco.py Β· splits.py Β· pipeline.py
|
| 121 |
+
β βββ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
|
| 122 |
+
β β transformer_decoder.py Β· captioning_model.py Β· factory.py
|
| 123 |
+
β βββ training/ losses.py Β· callbacks.py Β· trainer.py
|
| 124 |
+
β βββ inference/ image_loader.py Β· greedy.py Β· predictor.py
|
| 125 |
+
β βββ evaluation/ bleu.py
|
| 126 |
+
β βββ utils/ logging.py Β· seed.py Β· hashing.py
|
| 127 |
+
β
|
| 128 |
+
βββ configs/
|
| 129 |
+
β βββ base.yaml # IEEE hyperparameters (cell 6 mirror)
|
| 130 |
+
β βββ train/debug.yaml # CI smoke override
|
| 131 |
+
β
|
| 132 |
+
βββ scripts/
|
| 133 |
+
β βββ train.py Β· evaluate.py Β· predict.py
|
| 134 |
+
β βββ notebook_module_audit.py # Parity gate vs. notebook
|
| 135 |
+
β
|
| 136 |
+
βββ tests/unit/
|
| 137 |
+
β βββ test_caption_preprocessing.py Β· test_config.py Β· test_splits.py
|
| 138 |
+
β βββ test_tokenizer.py Β· test_image_preprocessing.py
|
| 139 |
+
β βββ test_evaluation.py Β· test_hashing.py
|
| 140 |
+
β βββ conftest.py
|
| 141 |
+
β
|
| 142 |
+
βββ docs/
|
| 143 |
+
β βββ restructure-plan.md Β· PHASE_0_NOTES.md Β· PHASE_1_NOTES.md
|
| 144 |
+
β
|
| 145 |
+
βββ pyproject.toml Β· requirements*.txt Β· Makefile
|
| 146 |
+
βββ .pre-commit-config.yaml Β· .python-version Β· .env.example
|
| 147 |
+
βββ .paper-notebook.sha256 # Locked notebook hash for CI freeze check
|
| 148 |
+
βββ README.md
|
| 149 |
+
```
|
| 150 |
|
| 151 |
---
|
| 152 |
|
| 153 |
+
## Setup
|
| 154 |
|
| 155 |
+
Requires **Python 3.10β3.12** (TensorFlow 2.15 has no 3.13 wheels).
|
| 156 |
|
| 157 |
+
### PowerShell (Windows)
|
| 158 |
|
| 159 |
+
```powershell
|
| 160 |
+
py -3.10 -m venv .venv
|
| 161 |
+
.venv\Scripts\activate
|
| 162 |
+
pip install -r requirements-dev.txt -r requirements-eval.txt
|
| 163 |
+
pip install -e ".[hf,mlflow]"
|
| 164 |
+
pre-commit install
|
| 165 |
+
```
|
| 166 |
|
| 167 |
+
### bash (Linux / macOS)
|
|
|
|
|
|
|
| 168 |
|
| 169 |
+
```bash
|
| 170 |
+
python3.10 -m venv .venv
|
| 171 |
+
source .venv/bin/activate
|
| 172 |
+
pip install -r requirements-dev.txt -r requirements-eval.txt
|
| 173 |
+
pip install -e ".[hf,mlflow]"
|
| 174 |
+
pre-commit install
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
`make help` lists every available command (lint, format, type-check, test, train, serve, evaluate, predict, Docker, freeze-paper-notebook, β¦).
|
| 178 |
|
| 179 |
---
|
| 180 |
|
| 181 |
+
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
|
| 183 |
+
The training script consumes a YAML config validated by Pydantic:
|
| 184 |
+
|
| 185 |
+
```bash
|
| 186 |
+
python -m scripts.train --config configs/base.yaml
|
| 187 |
+
```
|
| 188 |
|
| 189 |
+
Override fields without editing YAML:
|
| 190 |
|
| 191 |
+
```bash
|
| 192 |
+
# CLI smoke run on a 64-caption subset (1 epoch, batch 8)
|
| 193 |
+
python -m scripts.train --config configs/base.yaml --override configs/train/debug.yaml
|
| 194 |
|
| 195 |
+
# Env-var override (double-underscore = nesting delimiter)
|
| 196 |
+
CAPTIONING__TRAIN__BATCH_SIZE=32 python -m scripts.train --config configs/base.yaml
|
| 197 |
+
```
|
| 198 |
|
| 199 |
+
Outputs (`weights.h5`, `vocab.pkl` + `vocab.json` sidecar, `history.json`, `training_log.csv`) land under `outputs/runs/latest/` by default.
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
+
The `Trainer` ([`training/trainer.py`](src/captioning/training/trainer.py)) wraps `model.compile + model.fit` with structured logging and history serialisation; everything else (loss, callbacks, optimizer choice) sits in dedicated modules so each piece can be unit-tested in isolation.
|
| 202 |
|
| 203 |
+
---
|
| 204 |
|
| 205 |
+
## Evaluation
|
| 206 |
|
| 207 |
+
```bash
|
| 208 |
+
python -m scripts.evaluate \
|
| 209 |
+
--config configs/base.yaml \
|
| 210 |
+
--weights models/v1.0.0/model.h5 \
|
| 211 |
+
--tokenizer-dir models/v1.0.0 \
|
| 212 |
+
--report docs/results/v1.0.0.md \
|
| 213 |
+
--max-samples 500
|
| 214 |
+
```
|
| 215 |
|
| 216 |
+
Phase 1 ships **corpus BLEU-4 via sacrebleu** (deterministic, reproducible). CIDEr / METEOR / ROUGE-L slot into [`src/captioning/evaluation/`](src/captioning/evaluation/) in Phase 1b under the same runner interface.
|
|
|
|
|
|
|
|
|
|
| 217 |
|
| 218 |
---
|
| 219 |
|
| 220 |
+
## Inference
|
| 221 |
|
| 222 |
+
### Python API
|
| 223 |
|
| 224 |
+
```python
|
| 225 |
+
from captioning.config import load_config
|
| 226 |
+
from captioning.inference import CaptionPredictor
|
| 227 |
+
|
| 228 |
+
config = load_config("configs/base.yaml")
|
| 229 |
+
predictor = CaptionPredictor.from_artifacts(
|
| 230 |
+
weights_path="models/v1.0.0/model.h5",
|
| 231 |
+
tokenizer_dir="models/v1.0.0",
|
| 232 |
+
config=config,
|
| 233 |
+
)
|
| 234 |
+
predictor.warmup() # one dummy forward pass β kills first-request latency
|
| 235 |
+
caption = predictor.predict_path("photo.jpg")
|
| 236 |
+
print(caption)
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
### CLI
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
python -m scripts.predict \
|
| 243 |
+
--config configs/base.yaml \
|
| 244 |
+
--weights models/v1.0.0/model.h5 \
|
| 245 |
+
--tokenizer-dir models/v1.0.0 \
|
| 246 |
+
--image samples/photo.jpg
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
### REST API (planned β Phase 2)
|
| 250 |
+
|
| 251 |
+
A FastAPI service in `backend/app/` will expose `POST /v1/captions` (multipart upload), `GET /v1/model/info`, and `GET /healthz`, deployed to HuggingFace Spaces with the trained checkpoint pulled from the HuggingFace Hub at boot. The `CaptionPredictor` interface is the seam β the FastAPI lifespan instantiates one and reuses it across every request.
|
| 252 |
|
| 253 |
---
|
| 254 |
|
| 255 |
+
## Configuration system
|
| 256 |
+
|
| 257 |
+
Hyperparameters are not globals. They live in YAML files validated by Pydantic v2 `BaseSettings`:
|
| 258 |
+
|
| 259 |
+
```yaml
|
| 260 |
+
# configs/base.yaml β mirrors the IEEE notebook cell 6 verbatim
|
| 261 |
+
model:
|
| 262 |
+
embedding_dim: 512
|
| 263 |
+
units: 512
|
| 264 |
+
max_length: 40
|
| 265 |
+
vocabulary_size: 15000
|
| 266 |
+
decoder_num_heads: 8
|
| 267 |
+
decoder_dropout_inner: 0.3
|
| 268 |
+
decoder_dropout_outer: 0.5
|
| 269 |
+
decoder_attention_dropout: 0.1
|
| 270 |
+
train:
|
| 271 |
+
epochs: 10
|
| 272 |
+
batch_size: 64
|
| 273 |
+
early_stopping_patience: 3
|
| 274 |
+
seed: 42
|
| 275 |
+
data:
|
| 276 |
+
sample_size: 120000
|
| 277 |
+
train_val_split: 0.8
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
Three load-time guarantees:
|
| 281 |
+
|
| 282 |
+
1. **Type validation.** `batch_size: "64"` (string instead of int) raises a `ValidationError` pointing at the field, not a downstream tensor-shape error.
|
| 283 |
+
2. **No silent typos.** `extra="forbid"` rejects unknown keys (e.g. `vocabularsy_size`) β typos in ML hyperparameters silently using defaults is the worst possible failure mode, and `extra="forbid"` eliminates it.
|
| 284 |
+
3. **Env overrides.** `CAPTIONING__TRAIN__BATCH_SIZE=32` overrides at any nesting depth β useful for CI smoke tests, ablations, and serve-time tuning without rebuilding images.
|
| 285 |
|
| 286 |
+
Schema lives in [`src/captioning/config/schema.py`](src/captioning/config/schema.py); loader in [`config/loader.py`](src/captioning/config/loader.py).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
|
| 288 |
---
|
| 289 |
|
| 290 |
+
## Testing & code quality
|
| 291 |
|
| 292 |
+
```bash
|
| 293 |
+
make test # pytest 37/37 (unit + integration)
|
| 294 |
+
make lint # Ruff lint + format check
|
| 295 |
+
make typecheck # mypy strict on src/captioning + scripts
|
| 296 |
+
make pre-commit # All hooks across all files
|
| 297 |
+
make freeze-paper-notebook # Asserts notebook SHA-256 unchanged
|
| 298 |
```
|
| 299 |
|
| 300 |
+
| Layer | Tool | Status |
|
| 301 |
+
|---|---|---|
|
| 302 |
+
| Lint + format | [Ruff](https://docs.astral.sh/ruff/) (replaces black + isort + flake8) | β
clean |
|
| 303 |
+
| Type-check | [mypy](https://mypy.readthedocs.io/) with `pandas-stubs`, `types-PyYAML`, `types-requests` | β
0 errors / 34 files |
|
| 304 |
+
| Tests | pytest + pytest-cov + pytest-asyncio | β
37 passing |
|
| 305 |
+
| Notebook hygiene | [`nbstripout`](https://github.com/kynan/nbstripout) (pre-commit) | β
outputs stripped on commit |
|
| 306 |
+
| Secret scanning | [`gitleaks`](https://github.com/gitleaks/gitleaks) (pre-commit) | β
enabled |
|
| 307 |
+
| Notebook integrity | SHA-256 freeze check via [`make freeze-paper-notebook`](Makefile) | β
locked |
|
| 308 |
+
| Parity audit | [`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py) β 4 stages | β
all passing |
|
| 309 |
+
|
| 310 |
+
The parity audit re-implements four notebook stages inline (caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, decoder forward pass) and asserts the modular path produces byte-identical (or `tf.allclose`-identical) output. It is the contract that gates any behavioural improvement.
|
| 311 |
|
| 312 |
---
|
| 313 |
|
| 314 |
+
## Key engineering improvements
|
| 315 |
|
| 316 |
+
This is what separates this repository from a notebook conversion:
|
| 317 |
|
| 318 |
+
- **Modular package** with the `src/` layout β every test exercises the *installed* package the same way users will.
|
| 319 |
+
- **Strict Pydantic v2 configuration** β typed, validated, env-overridable, refuses unknown keys.
|
| 320 |
+
- **`CaptionTokenizer` wrapper** β stable interface for the model and inference; Phase 5 can swap it for HuggingFace `tokenizers` without touching the encoder, decoder, or generation loop.
|
| 321 |
+
- **Singleton-friendly inference** β `CaptionPredictor.from_artifacts(...)` + `warmup()` are designed for FastAPI lifespans, not just CLI calls.
|
| 322 |
+
- **Shared train/serve preprocessing** β the same `preprocess_image_tensor` runs in `tf.data` pipelines and at inference time, eliminating train/serve skew by construction.
|
| 323 |
+
- **Reproducibility** β seeded sampling, seeded splits, seeded RNGs (`utils.seed.set_global_seed`), pinned `tensorflow-cpu==2.15.0` (TF 2.16+ ships Keras 3 by default and silently breaks `TextVectorization` save/load).
|
| 324 |
+
- **Notebook freeze** β IEEE artefact protected by a SHA-256 check; published BLEU stays reproducible across the project's lifetime.
|
| 325 |
+
- **Optional dependency groups** (`[hf]`, `[eval]`, `[mlflow]`, `[dev]`) β slim production image stays lean; HF baselines and metric tooling are opt-in extras.
|
| 326 |
+
- **Decoupled experiment artefacts** β model weights live in HuggingFace Hub (planned), MLflow tracking on DagsHub free tier (planned). Git stays small.
|
| 327 |
+
- **Structured logging** β `structlog` emits JSON in production, pretty colourised logs in dev, switched by `APP_ENV`.
|
| 328 |
+
- **No silent rewrites** β every notebook β module move is documented with a cell mapping in [`docs/PHASE_1_NOTES.md`](docs/PHASE_1_NOTES.md); behavioural quirks (e.g. `compute_loss_and_acc` ignoring its `training` argument) are preserved verbatim with code comments referencing the doc.
|
| 329 |
+
|
| 330 |
+
---
|
| 331 |
|
| 332 |
+
## Limitations
|
| 333 |
|
| 334 |
+
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
|
| 335 |
+
- Greedy decoding only; beam search is a Phase 1b addition.
|
| 336 |
+
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
|
| 337 |
+
- BLEU is the only metric in v1; CIDEr / METEOR / ROUGE-L slot into the same runner interface in Phase 1b.
|
| 338 |
+
|
| 339 |
+
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
|
| 340 |
|
| 341 |
---
|
| 342 |
|
| 343 |
+
## Roadmap
|
| 344 |
+
|
| 345 |
+
- **Phase 1b** β beam search, CIDEr / METEOR / ROUGE-L, masked accuracy parity-fix, label smoothing, warmup + cosine LR schedule.
|
| 346 |
+
- **Phase 2** β FastAPI backend, HuggingFace Hub model upload, HuggingFace Spaces deploy, Vercel-hosted frontend (Next.js 14), GitHub Actions CI/CD.
|
| 347 |
+
- **Phase 3** β Tier-1 multimodal upgrades: BLIP-base / ViT-GPT2 / GIT-base-coco side-by-side comparison demo with per-model BLEU + latency.
|
| 348 |
+
- **Phase 4** β Sentry, Prometheus, DagsHub-hosted MLflow link, Architecture Decision Records (`docs/adr/`).
|
| 349 |
+
- **Future work** β ViT + Transformer fine-tune on COCO; VLM API integration (Anthropic Claude vision) behind a feature flag; VQA endpoint.
|
| 350 |
|
| 351 |
+
Detailed plan: [`docs/restructure-plan.md`](docs/restructure-plan.md).
|
|
|
|
|
|
|
| 352 |
|
| 353 |
---
|
| 354 |
|
| 355 |
+
## Citation
|
| 356 |
|
| 357 |
+
If you reference this work in academic writing, please cite the IEEE paper:
|
| 358 |
+
|
| 359 |
+
```bibtex
|
| 360 |
+
@inproceedings{ainarratives,
|
| 361 |
+
title = {AI Narratives: Bridging Visual Content and Linguistic Expression},
|
| 362 |
+
booktitle = {Proceedings of the IEEE Conference},
|
| 363 |
+
publisher = {IEEE},
|
| 364 |
+
year = {2024},
|
| 365 |
+
url = {https://ieeexplore.ieee.org/document/10675203},
|
| 366 |
+
}
|
| 367 |
+
```
|
| 368 |
+
|
| 369 |
+
---
|
| 370 |
+
|
| 371 |
+
## Acknowledgements
|
| 372 |
+
|
| 373 |
+
- The model architecture, hyperparameters, and BLEU baseline are from the IEEE-published paper *AI Narratives: Bridging Visual Content and Linguistic Expression*.
|
| 374 |
+
- COCO 2017 captions provided by the [Microsoft COCO project](https://cocodataset.org/).
|
| 375 |
+
- TensorFlow / Keras for the model layers; Pydantic for the configuration system; sacrebleu for evaluation; Ruff, mypy, and pytest for tooling.
|
| 376 |
|
| 377 |
---
|
| 378 |
|
| 379 |
+
## License
|
| 380 |
|
| 381 |
+
Released under the [MIT License](LICENSE). The IEEE paper itself is published under separate terms.
|
|
|
|
| 382 |
|
| 383 |
---
|
| 384 |
|
| 385 |
+
## Author
|
| 386 |
+
|
| 387 |
+
**Apoorv Raj** β AI / ML systems engineer.
|
| 388 |
+
Repository structured by phase; contributions and issues welcome.
|