apoorvrajdev commited on
Commit
77c9bce
Β·
1 Parent(s): 785dbd5

docs: restructure README with recruiter-focused layout and detailed phased roadmap

Browse files
Files changed (1) hide show
  1. README.md +370 -361
README.md CHANGED
@@ -1,124 +1,183 @@
1
- # Image Captioning System
2
 
3
- > CNN + Transformer architecture for visual-to-language generation, restructured from an IEEE-published research notebook into a production-style multimodal AI codebase.
4
-
5
- <p align="left">
6
- <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-3776AB?logo=python&logoColor=white">
7
- <img alt="TensorFlow 2.15" src="https://img.shields.io/badge/TensorFlow-2.15-FF6F00?logo=tensorflow&logoColor=white">
8
- <img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white">
9
- <img alt="FastAPI ready" src="https://img.shields.io/badge/FastAPI-ready-009688?logo=fastapi&logoColor=white">
10
  </p>
11
 
12
- <p align="left">
13
- <img alt="React 19" src="https://img.shields.io/badge/React-19-61DAFB?logo=react&logoColor=black">
14
- <img alt="Vite 8" src="https://img.shields.io/badge/Vite-8-646CFF?logo=vite&logoColor=white">
15
- <img alt="Frontend integrated" src="https://img.shields.io/badge/frontend-integrated-brightgreen">
16
- <img alt="API connected" src="https://img.shields.io/badge/API-connected-009688">
 
 
17
  </p>
18
 
19
- <p align="left">
20
- <img alt="Ruff" src="https://img.shields.io/badge/lint-ruff-261230?logo=ruff&logoColor=white">
21
- <img alt="mypy" src="https://img.shields.io/badge/typed-mypy-1F5082">
22
- <img alt="Tests" src="https://img.shields.io/badge/tests-37%20passing-brightgreen">
23
- <img alt="Pre-commit" src="https://img.shields.io/badge/pre--commit-enabled-FAB040?logo=pre-commit&logoColor=white">
 
 
24
  </p>
25
 
26
- <p align="left">
27
- <img alt="IEEE Published" src="https://img.shields.io/badge/IEEE-published-00629B?logo=ieee&logoColor=white">
28
- <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-lightgrey">
29
- <img alt="Phase 1" src="https://img.shields.io/badge/Phase%201-complete-brightgreen">
30
- <img alt="Phase 2A" src="https://img.shields.io/badge/Phase%202A-complete-brightgreen">
31
- <img alt="Phase 2B" src="https://img.shields.io/badge/Phase%202B-complete-brightgreen">
32
  </p>
33
 
34
  ---
35
 
36
- ## Overview
 
 
37
 
38
- This repository implements an **end-to-end image-captioning pipeline** built around an InceptionV3 visual encoder and a custom multi-head Transformer decoder. The architecture is the basis of the IEEE-published paper *β€œAI Narratives: Bridging Visual Content and Linguistic Expression”*; this codebase lifts the original Kaggle research notebook into a typed, tested, configuration-driven Python package that can be reused from CLI, scripts, or a future serving layer.
39
 
40
- With Phase 2B complete, the system now runs as a **full-stack inference workflow**: a React/Vite frontend issues multipart uploads to the FastAPI `POST /v1/captions` endpoint, the backend predictor returns a typed response, and the end-to-end image-to-caption interaction is operational in the browser.
41
 
42
- The repository is structured in deliberate phases:
43
 
44
- | Phase | Focus | Status |
45
- |---|---|---|
46
- | 0 β€” Bootstrap | Tooling, packaging, freeze policy | βœ… complete |
47
- | 1 β€” Modularisation | Notebook β†’ typed Python package, parity audit, unit tests | βœ… complete |
48
- | 2A β€” Backend Infrastructure | FastAPI inference API, structured logging, schemas, health checks, Swagger/OpenAPI, predictor lifecycle | βœ… complete |
49
- | 2B β€” Frontend UI | React/Vite frontend + upload UX + API integration | βœ… complete |
50
- | 3 β€” Multimodal baselines | BLIP / ViT-GPT2 / GIT side-by-side comparison | ⏳ planned |
51
- | 4 β€” Observability | Sentry, Prometheus metrics, ADRs | ⏳ planned |
52
 
53
- Phase notes live under [`docs/`](docs/): [restructure plan](docs/restructure-plan.md) Β· [Phase 0 notes](docs/PHASE_0_NOTES.md) Β· [Phase 1 notes](docs/PHASE_1_NOTES.md).
54
 
55
  ---
56
 
57
- ## Research backing
58
 
59
- The model architecture and the BLEU-4 ~24 baseline below come from the IEEE paper and its accompanying notebook:
60
 
61
- - **Paper:** [AI Narratives: Bridging Visual Content and Linguistic Expression](https://ieeexplore.ieee.org/document/10675203) (IEEE)
62
- - **Original notebook:** [Kaggle β€” image-captioning-using-dl](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
63
- - **Frozen artefact in this repo:** [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](notebooks/01_ieee_inceptionv3_transformer.ipynb) β€” byte-stable; CI enforces its SHA-256.
64
 
65
- The notebook is preserved verbatim as the canonical research artefact. Improvements happen in the modular package; the notebook does not.
 
 
 
 
 
66
 
67
  ---
68
 
69
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```
72
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
73
- β”‚ Input image │──▢│ InceptionV3 │──▢│ Transformer │──▢│ Transformer │──▢│ Caption β”‚
74
- β”‚ 299x299x3 β”‚ β”‚ encoder β”‚ β”‚ encoder β”‚ β”‚ decoder β”‚ β”‚ string β”‚
75
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (ImageNet, β”‚ β”‚ (1 layer, β”‚ β”‚ (2 layers, β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
76
- β”‚ frozen) β”‚ β”‚ 1 head) β”‚ β”‚ 8 heads) β”‚
77
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
78
  β–Ό β–Ό β–Ό
79
- [B, 64, 2048] [B, 64, 512] [B, T, vocab]
80
- patch features projected features softmax over 15k tokens
81
  ```
82
 
83
  ### Components
84
 
85
- - **CNN encoder** β€” [`models/encoder_cnn.py`](src/captioning/models/encoder_cnn.py). Pretrained InceptionV3 with the classification head removed; output reshaped to a sequence of 64 spatial positions Γ— 2048 channels. Weights are frozen during training.
86
- - **Transformer encoder** β€” [`models/transformer_encoder.py`](src/captioning/models/transformer_encoder.py). Single layer with one attention head. Projects InceptionV3 features into the decoder’s embedding dimension and lets the decoder attend across spatial positions.
87
- - **Embeddings** β€” [`models/embeddings.py`](src/captioning/models/embeddings.py). Sum of token and *learned* positional embeddings (not sinusoidal β€” preserved from the published architecture).
88
- - **Transformer decoder** β€” [`models/transformer_decoder.py`](src/captioning/models/transformer_decoder.py). Causal self-attention over partial captions, cross-attention over image features, and a feed-forward sub-block. 8 heads, ``embedding_dim=512``, dropouts (0.1 / 0.3 / 0.5) preserved from the IEEE configuration.
89
  - **Captioning model** β€” [`models/captioning_model.py`](src/captioning/models/captioning_model.py). Custom `train_step` / `test_step` with masked sparse-categorical cross-entropy and masked accuracy.
90
- - **Tokenizer** β€” [`preprocessing/tokenizer.py`](src/captioning/preprocessing/tokenizer.py). `CaptionTokenizer` wraps `tf.keras.layers.TextVectorization`; persists the vocabulary as both pickle (notebook-compatible) and JSON sidecar.
91
- - **Inference** β€” [`inference/predictor.py`](src/captioning/inference/predictor.py). `CaptionPredictor.from_artifacts(weights, vocab, config)` loads everything once at boot, exposes `predict_path(...)` and `predict_tensor(...)` for stateless calls, and `warmup()` for first-request latency.
92
- - **Configuration** β€” [`config/schema.py`](src/captioning/config/schema.py). Pydantic v2 schemas (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`); strict (`extra="forbid"`) so typos in YAML or env vars become load-time errors instead of silent drift.
 
 
 
 
93
 
94
  ---
95
 
96
- ## Sample outputs
97
 
98
  | Image | Generated caption |
99
  |---|---|
100
  | ![](https://github.com/user-attachments/assets/64e8412b-1d49-404c-a5b2-1da121b224e2) | *a man is standing on a beach with a surfboard* |
101
  | ![](https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd) | *a man riding a motorcycle on a street* |
102
 
103
- Outputs above are from the IEEE notebook; the modular pipeline reproduces these via the parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)).
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  ---
106
 
107
- ## Performance
108
 
109
  | Metric | Value | Source |
110
  |---|---|---|
111
- | BLEU-4 | ~24 | Reported in the IEEE paper / Kaggle notebook |
112
- | Vocabulary size | 15,000 tokens | TextVectorization adapt over preprocessed COCO captions |
113
  | Training set | ~120k captions sampled from COCO 2017 | `data.sample_size` in [`configs/base.yaml`](configs/base.yaml) |
114
  | Image resolution | 299 Γ— 299 (InceptionV3) | [`preprocessing/image.py`](src/captioning/preprocessing/image.py) |
115
  | Max caption length | 40 tokens | `model.max_length` in [`configs/base.yaml`](configs/base.yaml) |
 
 
116
 
117
- > Re-training on the modular pipeline is a Phase 2 deliverable; once a fresh checkpoint exists, this table will be expanded with corpus BLEU-1..4, CIDEr, METEOR, and ROUGE-L (already implemented in [`evaluation/`](src/captioning/evaluation/)).
118
 
119
  ---
120
 
121
- ## Current model quality status
122
 
123
  The frontend, backend, and inference pipeline are operational end-to-end against the modular package, but **caption quality from the current modular pipeline is still below expectations**. The IEEE notebook reported BLEU-4 ~24; a freshly trained checkpoint produced by the modular trainer has not yet reproduced that figure on COCO. The serving stack is production-style and ready for a real checkpoint β€” what is missing is the checkpoint itself.
124
 
@@ -129,29 +188,40 @@ Current engineering effort is focused on:
129
  - **Decoding improvements** β€” replacing greedy-only generation with beam search, repetition controls, and length normalisation.
130
  - **Reproducible benchmarking** β€” emitting one consistent artefact set per evaluation run so any two runs (or any two models) can be diffed without bespoke parsing per checkpoint.
131
 
132
- The weights currently committed under [`models/v1.0.0/`](models/v1.0.0/) are the **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py): the architecture is wired correctly, but every weight is randomly initialised. They exist to exercise the serving stack β€” lifespan, predictor wiring, multipart upload, frontend integration β€” before a real COCO-trained checkpoint is dropped in. Captions returned by the live API today will therefore look like noise; that is the *intended* state of the bootstrap path, not a regression. Poor caption quality at this stage is expected until a properly COCO-trained checkpoint replaces those files.
133
 
134
- This gap is being addressed through the **stabilized training workflow** introduced at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml), which gates the convergence-stability primitives behind explicit, ablatable flags rather than rewriting the baseline.
135
 
136
  ### Accuracy investigation (ongoing)
137
 
138
- The shift from "notebook reproduction" to "modular pipeline that *also* trains well" surfaced several concrete findings, each addressed in code rather than in commentary:
139
-
140
- - **Greedy decoding limited caption quality and diversity.** Argmax-per-step decoding routinely picked the locally-most-probable token regardless of how that affected the overall sequence likelihood, biasing outputs toward a small "safe captions" basin. Beam-search infrastructure now lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) and dispatches through `CaptionPredictor` alongside the existing greedy path; decode strategy is selectable per inference call and per evaluation run.
141
- - **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) (`cider.py`, `meteor.py`, `rouge.py`) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
142
  - **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
143
- - **Training stabilization experiments** were introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
144
  - label smoothing (`train.label_smoothing`),
145
  - cosine LR schedule (`train.lr_schedule: cosine`),
146
  - warmup steps (`train.warmup_steps`),
147
  - dropout-free validation path (`train.honour_training_flag_in_test_step`).
148
- - A complete experimental training config β€” not a thin override β€” lives at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml). It is byte-for-byte identical to [`configs/base.yaml`](configs/base.yaml) except for the four flags above, so any quality delta between the two runs is attributable to those flags alone.
149
 
150
- These changes are aimed at convergence stability and caption generalisation **before** Phase 3 model upgrades. Comparing the original CNN + Transformer against modern multimodal baselines is only meaningful once the original is trained to the strongest version of itself the architecture can support.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  ---
153
 
154
- ## Project structure
155
 
156
  ```
157
  image-captioning-system/
@@ -174,70 +244,58 @@ image-captioning-system/
174
  β”œβ”€β”€ backend/ # Phase 2A β€” FastAPI inference service
175
  β”‚ └── app/
176
  β”‚ β”œβ”€β”€ main.py # App factory + lifespan-managed predictor singleton
177
- β”‚ β”œβ”€β”€ api/ # Thin HTTP routes β€” /healthz, /v1/captions
178
- β”‚ β”œβ”€β”€ core/ # BackendSettings, structured logging, request IDs
179
- β”‚ β”œβ”€β”€ schemas/ # Pydantic request/response schemas
180
- β”‚ β”œβ”€β”€ services/ # PredictorService β€” image bytes β†’ caption + latency
181
- β”‚ └── utils/ # Image decoding + content-type guards
 
182
  β”‚
183
  β”œβ”€β”€ frontend/ # Phase 2B β€” React 19 + Vite 8 + Tailwind v4 SPA
184
- β”‚ β”œβ”€β”€ index.html # Vite entry; mounts <App /> into #root
185
- β”‚ β”œβ”€β”€ vite.config.js # Vite + @vitejs/plugin-react + Tailwind v4 plugin
186
- β”‚ β”œβ”€β”€ eslint.config.js # Flat ESLint config (React + Hooks + React Refresh)
187
- β”‚ β”œβ”€β”€ package.json # React 19, Vite 8, Tailwind v4
188
- β”‚ β”œβ”€β”€ .env.example # VITE_API_BASE β€” env-driven backend origin
189
- β”‚ β”œβ”€β”€ public/ # Static assets served verbatim (favicon, icons)
190
  β”‚ └── src/
191
- β”‚ β”œβ”€β”€ main.jsx # React root + StrictMode bootstrap
192
- β”‚ β”œβ”€β”€ App.jsx # Page composition + upload β†’ generate flow
193
- β”‚ β”œβ”€β”€ index.css # Tailwind v4 entry (single @import)
194
- β”‚ β”œβ”€β”€ services/
195
- β”‚ β”‚ └── api.js # checkHealth / captionImage β€” AbortController + typed ApiError
196
  β”‚ └── components/
197
- β”‚ β”œβ”€β”€ Header.jsx # Brand bar + StatusBadge slot
198
- β”‚ β”œβ”€β”€ StatusBadge.jsx # /healthz poller (10 s) β€” checking/online/offline state machine
199
- β”‚ β”œβ”€β”€ UploadZone.jsx # Drag/drop + click-to-browse + client-side validation
200
- β”‚ β”œβ”€β”€ ImagePreview.jsx # Selected-file preview + size/format meta + clear
201
- β”‚ β”œβ”€β”€ CaptionResult.jsx # Caption + model_version / decode / latency / request_id
202
- β”‚ β”œβ”€β”€ ErrorBanner.jsx # Dismissible error display (network / timeout / HTTP)
203
- β”‚ └── Spinner.jsx # Shared loading indicator (sm / md / lg)
204
  β”‚
205
  β”œβ”€β”€ configs/
206
- β”‚ β”œβ”€β”€ base.yaml # IEEE hyperparameters (cell 6 mirror)
207
  β”‚ └── train/
208
- β”‚ β”œβ”€β”€ debug.yaml # CI smoke override
209
- β”‚ └── stabilized.yaml # Phase 1b stability experiment (label smoothing, cosine LR, warmup)
210
  β”‚
211
  β”œβ”€β”€ scripts/
212
  β”‚ β”œβ”€β”€ train.py Β· evaluate.py Β· predict.py
213
- β”‚ β”œβ”€β”€ inspect_predictions.py # Per-sample diagnostics + diagnostics.jsonl writer
214
  β”‚ β”œβ”€β”€ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
215
- β”‚ └── notebook_module_audit.py # Parity gate vs. notebook
216
- β”‚
217
- β”œβ”€β”€ tests/unit/
218
- β”‚ β”œβ”€β”€ test_caption_preprocessing.py Β· test_config.py Β· test_splits.py
219
- β”‚ β”œβ”€β”€ test_tokenizer.py Β· test_image_preprocessing.py
220
- β”‚ β”œβ”€β”€ test_evaluation.py Β· test_hashing.py
221
- β”‚ └── conftest.py
222
- β”‚
223
- β”œβ”€β”€ docs/
224
- β”‚ β”œβ”€β”€ restructure-plan.md Β· PHASE_0_NOTES.md Β· PHASE_1_NOTES.md
225
  β”‚
 
 
226
  β”œβ”€β”€ pyproject.toml Β· requirements*.txt Β· Makefile
227
  β”œβ”€β”€ .pre-commit-config.yaml Β· .python-version Β· .env.example
228
- β”œβ”€β”€ .paper-notebook.sha256 # Locked notebook hash for CI freeze check
 
229
  └── README.md
230
  ```
231
 
232
  ---
233
 
234
- ## Setup
235
 
236
- Requires **Python 3.10–3.12** (TensorFlow 2.15 has no 3.13 wheels).
237
 
238
- ### PowerShell (Windows)
 
 
 
 
239
 
240
  ```powershell
 
241
  py -3.10 -m venv .venv
242
  .venv\Scripts\activate
243
  pip install -r requirements-dev.txt -r requirements-eval.txt
@@ -245,9 +303,8 @@ pip install -e ".[hf,mlflow]"
245
  pre-commit install
246
  ```
247
 
248
- ### bash (Linux / macOS)
249
-
250
  ```bash
 
251
  python3.10 -m venv .venv
252
  source .venv/bin/activate
253
  pip install -r requirements-dev.txt -r requirements-eval.txt
@@ -255,69 +312,33 @@ pip install -e ".[hf,mlflow]"
255
  pre-commit install
256
  ```
257
 
258
- `make help` lists every available command (lint, format, type-check, test, train, serve, evaluate, predict, Docker, freeze-paper-notebook, …).
259
-
260
- ---
261
-
262
- ## Training
263
-
264
- The training script consumes a YAML config validated by Pydantic:
265
-
266
- ```bash
267
- python -m scripts.train --config configs/base.yaml
268
- ```
269
-
270
- Override fields without editing YAML:
271
 
272
  ```bash
273
- # CLI smoke run on a 64-caption subset (1 epoch, batch 8)
274
- python -m scripts.train --config configs/base.yaml --override configs/train/debug.yaml
275
-
276
- # Env-var override (double-underscore = nesting delimiter)
277
- CAPTIONING__TRAIN__BATCH_SIZE=32 python -m scripts.train --config configs/base.yaml
278
  ```
279
 
280
- Outputs (`weights.h5`, `vocab.pkl` + `vocab.json` sidecar, `history.json`, `training_log.csv`) land under `outputs/runs/latest/` by default.
281
-
282
- The `Trainer` ([`training/trainer.py`](src/captioning/training/trainer.py)) wraps `model.compile + model.fit` with structured logging and history serialisation; everything else (loss, callbacks, optimizer choice) sits in dedicated modules so each piece can be unit-tested in isolation.
283
-
284
- ---
285
 
286
- ## Evaluation
287
 
288
  ```bash
289
- python -m scripts.evaluate \
290
- --config configs/base.yaml \
291
- --weights models/v1.0.0/model.h5 \
292
- --tokenizer-dir models/v1.0.0 \
293
- --report docs/results/v1.0.0.md \
294
- --max-samples 500
295
  ```
296
 
297
- Phase 1 ships **corpus BLEU-4 via sacrebleu** (deterministic, reproducible). CIDEr / METEOR / ROUGE-L slot into [`src/captioning/evaluation/`](src/captioning/evaluation/) in Phase 1b under the same runner interface.
298
-
299
- ---
300
-
301
- ## Inference
302
 
303
- ### Python API
304
 
305
- ```python
306
- from captioning.config import load_config
307
- from captioning.inference import CaptionPredictor
308
-
309
- config = load_config("configs/base.yaml")
310
- predictor = CaptionPredictor.from_artifacts(
311
- weights_path="models/v1.0.0/model.h5",
312
- tokenizer_dir="models/v1.0.0",
313
- config=config,
314
- )
315
- predictor.warmup() # one dummy forward pass β€” kills first-request latency
316
- caption = predictor.predict_path("photo.jpg")
317
- print(caption)
318
  ```
319
 
320
- ### CLI
321
 
322
  ```bash
323
  python -m scripts.predict \
@@ -327,53 +348,39 @@ python -m scripts.predict \
327
  --image samples/photo.jpg
328
  ```
329
 
330
- ### REST API (Phase 2A β€” operational)
331
-
332
- A FastAPI service under [`backend/app/`](backend/app/) is now live. The lifespan instantiates a single `CaptionPredictor`, runs `warmup()` once, and reuses it across every request β€” no per-request TF graph builds, no first-request latency cliff. The service currently boots against development bootstrap artefacts (see below); real Phase 1 weights drop in by replacing the files under `models/v1.0.0/`, no code changes required.
333
 
334
  ```bash
335
- # Boot the API
336
- uvicorn --app-dir backend app.main:app --host 0.0.0.0 --port 8000
337
-
338
- # Liveness + readiness (returns model_loaded + model_version + api_version)
339
- curl http://localhost:8000/healthz
340
-
341
- # Generate a caption from a multipart upload
342
- curl -X POST http://localhost:8000/v1/captions \
343
- -F "image=@samples/photo.jpg"
344
  ```
345
 
346
- Interactive Swagger UI is auto-generated at [`/docs`](http://localhost:8000/docs); the raw schema lives at [`/openapi.json`](http://localhost:8000/openapi.json).
347
-
348
- ### Frontend (Phase 2B β€” operational)
349
-
350
- A React 19 + Vite 8 + Tailwind v4 single-page app under [`frontend/`](frontend/) drives the same endpoints from the browser. The SPA posts multipart `FormData` to `POST /v1/captions`, polls `GET /healthz` every 10 seconds for a live status badge, consumes the typed `CaptionResponse` schema, and renders caption + `model_version` + `decode_strategy` + `latency_ms` + `request_id` exactly as the backend returns them. Loading, error, and success states are surfaced through dedicated components; network failures, request timeouts (3 s health / 60 s caption), CORS rejections, and non-2xx responses are all classified into a single typed `ApiError` shape so the UI shows actionable copy instead of a raw `Failed to fetch`.
351
 
352
  ```bash
353
- # Boot the frontend dev server
354
- cd frontend
355
- npm install
356
- npm run dev
357
- # Defaults to http://localhost:5173 (Vite picks the next free port if 5173 is busy)
358
  ```
359
 
360
- `VITE_API_BASE` (see [`frontend/.env.example`](frontend/.env.example)) points the SPA at any backend origin; absent the env var, the client falls back to `http://127.0.0.1:8000`. The dev origins `localhost:5173/5174` and `127.0.0.1:5173/5174` are pre-allowed in [`configs/base.yaml`](configs/base.yaml) under `serve.cors_allowed_origins` so the browser accepts cross-origin responses end-to-end.
 
 
361
 
362
  ---
363
 
364
- ## FastAPI backend
365
 
366
- Phase 2A delivers a production-style inference service rather than a thin demo wrapper. The split mirrors how a real serving stack is laid out:
367
 
368
  - **App factory + lifespan** β€” [`backend/app/main.py`](backend/app/main.py). `create_app()` builds the FastAPI instance; the lifespan loads the YAML `AppConfig`, instantiates a `CaptionPredictor`, calls `warmup()`, and stashes a `PredictorService` singleton on `app.state` so every request reuses one warm model.
369
  - **Routes** β€” [`backend/app/api/routes.py`](backend/app/api/routes.py). Intentionally thin: validate inputs, delegate, shape the response. No TF imports leak into the HTTP layer.
370
- - **Service layer** β€” [`backend/app/services/predictor_service.py`](backend/app/services/predictor_service.py). Wraps the predictor, decodes uploaded bytes, measures per-request latency, and returns `(caption, latency_ms)`.
371
- - **Schemas** β€” [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py). Pydantic v2 request/response models (`CaptionResponse`, `HealthResponse`, `ErrorResponse`) β€” every payload that crosses the wire is typed and OpenAPI-documented.
372
- - **Backend settings** β€” [`backend/app/core/config.py`](backend/app/core/config.py). Separate `BackendSettings` (env-overridable: weights path, tokenizer dir, model version, warmup toggle) layered on top of the research-side `AppConfig`. The two are deliberately distinct: research hyperparameters and serving knobs change on different cadences.
373
  - **Structured logging + request IDs** β€” [`backend/app/core/logging.py`](backend/app/core/logging.py). `RequestContextMiddleware` stamps each request with a UUID; `structlog` carries it through every log line so a single failed caption can be traced end-to-end.
374
- - **Image safety** β€” [`backend/app/utils/image.py`](backend/app/utils/image.py). Content-type allow-list (JPEG / PNG / WebP / BMP), explicit `ImageDecodeError` so malformed bytes produce a clean `422` rather than a 500.
375
-
376
- ### Endpoints
377
 
378
  | Method | Path | Purpose |
379
  |---|---|---|
@@ -382,34 +389,20 @@ Phase 2A delivers a production-style inference service rather than a thin demo w
382
  | `GET` | `/docs` | Interactive Swagger UI, auto-generated from the Pydantic schemas. |
383
  | `GET` | `/openapi.json` | Raw OpenAPI 3.1 spec for client codegen. |
384
 
385
- `POST /v1/captions` enforces input validation at the boundary: 415 on disallowed content types, 413 on oversized uploads (`serve.max_upload_bytes`), 422 on undecodable image bytes, 400 on empty uploads, 503 while the predictor is still loading during a rolling restart.
386
-
387
- ### Bootstrap dev artifacts
388
-
389
- [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py) generates a *valid but untrained* set of weights + tokenizer under `models/v1.0.0/` so the entire serving stack β€” lifespan, routes, multipart upload, predictor wiring β€” can be exercised end-to-end before Phase 1 training has been run on COCO. **The captions it produces are gibberish by design**: every weight is randomly initialised. The point is architectural smoke-testing, not prediction quality. Drop real Phase 1 outputs into the same directory and the backend serves them with zero code changes.
390
-
391
- ```bash
392
- python -m scripts.bootstrap_dev_artifacts \
393
- --config configs/base.yaml \
394
- --output-dir models/v1.0.0
395
- ```
396
 
397
  ---
398
 
399
- ## Frontend UI (Phase 2B)
400
 
401
- Phase 2B ships a single-page inference UI under [`frontend/`](frontend/), not a styled demo. The split mirrors the backend's separation between transport, service, and presentation:
402
 
403
- - **Application shell** β€” [`frontend/src/App.jsx`](frontend/src/App.jsx). Owns the request lifecycle (selected file β†’ preview β†’ generate β†’ result). The preview `URL.createObjectURL` is `useMemo`-derived and revoked through an effect cleanup so previews never leak memory across uploads. Four `useState` slots (`file`, `result`, `error`, `loading`) cover every UI state β€” no Redux, no React Query, no context.
404
- - **API service layer** β€” [`frontend/src/services/api.js`](frontend/src/services/api.js). Single boundary for every backend call. Reads `import.meta.env.VITE_API_BASE` once at module load (falls back to `http://127.0.0.1:8000`), wraps `fetch` with `AbortController`-driven timeouts (3 s for `/healthz`, 60 s for `/v1/captions`), and classifies failures into `timeout` / `network` / `http` / `unknown` kinds on a typed `ApiError` so components never see a raw `TypeError`.
405
- - **Upload zone** β€” [`frontend/src/components/UploadZone.jsx`](frontend/src/components/UploadZone.jsx). Drag/drop + click-to-browse + keyboard activation (`Enter` / `Space`). Validates content-type (JPEG / PNG / WebP) and size (10 MB) before the file ever touches the network β€” invalid uploads are rejected client-side with the same wording the backend would have returned, so the user experience is consistent whether validation fires locally or remotely.
406
- - **Image preview** β€” [`frontend/src/components/ImagePreview.jsx`](frontend/src/components/ImagePreview.jsx). Renders the selected file via its object URL with size/format metadata and a clear button. Disabled while a request is in flight so re-drops cannot race the POST.
407
- - **Caption result** β€” [`frontend/src/components/CaptionResult.jsx`](frontend/src/components/CaptionResult.jsx). Consumes the backend's typed `CaptionResponse` directly: caption text plus model version, decode strategy, latency in milliseconds, and the request ID echoed from the `x-request-id` header. Copy-to-clipboard is built in for log correlation during debugging.
408
- - **Status badge** β€” [`frontend/src/components/StatusBadge.jsx`](frontend/src/components/StatusBadge.jsx). Polls `/healthz` every 10 seconds and on window focus, runs a three-state machine (`checking` / `online` / `offline`), and recovers automatically when the backend comes back β€” no page reload required.
409
  - **Error banner** β€” [`frontend/src/components/ErrorBanner.jsx`](frontend/src/components/ErrorBanner.jsx). Single surface for every failure class. Reads `ApiError.message` so the user sees "Cannot reach backend" or "Request timed out" instead of a raw browser error.
410
- - **Spinner / Header** β€” [`frontend/src/components/Spinner.jsx`](frontend/src/components/Spinner.jsx) and [`frontend/src/components/Header.jsx`](frontend/src/components/Header.jsx). Shared loading indicator and the sticky brand bar that hosts the status badge.
411
-
412
- ### Upload flow
413
 
414
  ```
415
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” drag/drop β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” validate β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
@@ -423,56 +416,18 @@ Phase 2B ships a single-page inference UI under [`frontend/`](frontend/), not a
423
  β”‚ typed CaptionResponse / ApiError
424
  β–Ό
425
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
426
- β”‚ CaptionResult / β”‚
427
  β”‚ ErrorBanner β”‚
428
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
429
  ```
430
 
431
- ### State, transport, and frontend/backend separation
432
-
433
- State management is intentionally local: four `useState` slots in `App.jsx` (`file`, `result`, `error`, `loading`) plus a `useMemo`-derived preview URL. The data flow is shallow enough that an extra abstraction would obscure rather than help. All cross-cutting concerns β€” timeouts, error classification, env-driven base URL β€” live in the API service layer so components stay declarative and lift no transport details into JSX.
434
-
435
- Frontend and backend are deployed independently. The SPA only knows the backend's origin via `VITE_API_BASE`; the backend only trusts SPAs whose origin appears in `serve.cors_allowed_origins`. Dev origins (`localhost:5173/5174`, `127.0.0.1:5173/5174`) are pre-allowed in [`configs/base.yaml`](configs/base.yaml); production origins join the same list at deploy time. No shared build, no shared runtime, no shared state β€” only the typed Pydantic schemas defined in [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py) cross the wire.
436
-
437
- ### UX, error handling, and loading states
438
-
439
- - **Loading** β€” the Generate button shows the shared [`Spinner`](frontend/src/components/Spinner.jsx) and disables itself for the entire request; the upload zone is locked in parallel so a re-drop cannot race the in-flight POST.
440
- - **Errors** β€” every failure surfaces through `ErrorBanner` with copy specific to its `ApiError.kind`. Network/CORS failures, request timeouts, and `4xx` / `5xx` payloads each map to a distinct, actionable message.
441
- - **Status awareness** β€” when the backend is down, `StatusBadge` flips to red within one poll cycle; when it comes back, the badge recovers automatically without a page reload, and a fresh `/healthz` is also fired on window focus.
442
- - **Responsive layout** β€” Tailwind v4's grid (`lg:grid-cols-5`) drops to a single column under the `lg` breakpoint, preserving the upload β†’ preview β†’ result flow on tablet and phone widths. The sticky header keeps the live status badge visible while scrolling.
443
-
444
- ### Environment configuration
445
-
446
- ```bash
447
- # frontend/.env (gitignored) β€” overrides the default backend origin
448
- VITE_API_BASE=http://127.0.0.1:8000
449
- ```
450
-
451
- The variable is read once at module load and stripped of any trailing slash. Absent the variable, the client falls back to `http://127.0.0.1:8000`; production builds set the variable at build time so the SPA can ship as static assets to Vercel, Cloudflare Pages, HuggingFace Spaces, or any CDN.
452
-
453
- ### Production deployment readiness
454
-
455
- - **Static-asset build** β€” `npm run build` emits a hash-named bundle under `frontend/dist/` that any static host can serve; no runtime Node process is required.
456
- - **Origin pinning** β€” the CORS allow-list in `configs/base.yaml` plus `VITE_API_BASE` at build time tie a given SPA build to a specific backend origin without a runtime config endpoint.
457
- - **No secrets in the client** β€” the SPA carries no API keys; the only network surface it depends on is `/healthz` and `/v1/captions` on the configured backend.
458
- - **Lint-clean** β€” `npm run lint` (flat ESLint config with `eslint-plugin-react-hooks` and `eslint-plugin-react-refresh`) runs alongside the Python tooling.
459
-
460
- ```bash
461
- # Development server (Vite + HMR on :5173)
462
- cd frontend
463
- npm install
464
- npm run dev
465
-
466
- # Production build + local preview of the built bundle
467
- npm run build
468
- npm run preview
469
- ```
470
 
471
  ---
472
 
473
- ## Configuration system
474
 
475
- Hyperparameters are not globals. They live in YAML files validated by Pydantic v2 `BaseSettings`:
476
 
477
  ```yaml
478
  # configs/base.yaml β€” mirrors the IEEE notebook cell 6 verbatim
@@ -498,17 +453,17 @@ data:
498
  Three load-time guarantees:
499
 
500
  1. **Type validation.** `batch_size: "64"` (string instead of int) raises a `ValidationError` pointing at the field, not a downstream tensor-shape error.
501
- 2. **No silent typos.** `extra="forbid"` rejects unknown keys (e.g. `vocabularsy_size`) β€” typos in ML hyperparameters silently using defaults is the worst possible failure mode, and `extra="forbid"` eliminates it.
502
  3. **Env overrides.** `CAPTIONING__TRAIN__BATCH_SIZE=32` overrides at any nesting depth β€” useful for CI smoke tests, ablations, and serve-time tuning without rebuilding images.
503
 
504
- Schema lives in [`src/captioning/config/schema.py`](src/captioning/config/schema.py); loader in [`config/loader.py`](src/captioning/config/loader.py).
505
 
506
  ---
507
 
508
- ## Testing & code quality
509
 
510
  ```bash
511
- make test # pytest 37/37 (unit + integration)
512
  make lint # Ruff lint + format check
513
  make typecheck # mypy strict on src/captioning + scripts
514
  make pre-commit # All hooks across all files
@@ -518,136 +473,190 @@ make freeze-paper-notebook # Asserts notebook SHA-256 unchanged
518
  | Layer | Tool | Status |
519
  |---|---|---|
520
  | Lint + format | [Ruff](https://docs.astral.sh/ruff/) (replaces black + isort + flake8) | βœ… clean |
521
- | Type-check | [mypy](https://mypy.readthedocs.io/) with `pandas-stubs`, `types-PyYAML`, `types-requests` | βœ… 0 errors / 34 files |
522
- | Tests | pytest + pytest-cov + pytest-asyncio | βœ… 37 passing |
523
  | Notebook hygiene | [`nbstripout`](https://github.com/kynan/nbstripout) (pre-commit) | βœ… outputs stripped on commit |
524
  | Secret scanning | [`gitleaks`](https://github.com/gitleaks/gitleaks) (pre-commit) | βœ… enabled |
525
- | Notebook integrity | SHA-256 freeze check via [`make freeze-paper-notebook`](Makefile) | βœ… locked |
526
  | Parity audit | [`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py) β€” 4 stages | βœ… all passing |
527
 
528
  The parity audit re-implements four notebook stages inline (caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, decoder forward pass) and asserts the modular path produces byte-identical (or `tf.allclose`-identical) output. It is the contract that gates any behavioural improvement.
529
 
 
 
530
  ---
531
 
532
- ## Key engineering improvements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
533
 
534
- This is what separates this repository from a notebook conversion:
535
 
536
- - **Modular package** with the `src/` layout β€” every test exercises the *installed* package the same way users will.
537
- - **Strict Pydantic v2 configuration** β€” typed, validated, env-overridable, refuses unknown keys.
538
- - **`CaptionTokenizer` wrapper** β€” stable interface for the model and inference; Phase 5 can swap it for HuggingFace `tokenizers` without touching the encoder, decoder, or generation loop.
539
- - **Singleton-friendly inference** β€” `CaptionPredictor.from_artifacts(...)` + `warmup()` are designed for FastAPI lifespans, not just CLI calls.
540
- - **Shared train/serve preprocessing** β€” the same `preprocess_image_tensor` runs in `tf.data` pipelines and at inference time, eliminating train/serve skew by construction.
541
- - **Reproducibility** β€” seeded sampling, seeded splits, seeded RNGs (`utils.seed.set_global_seed`), pinned `tensorflow-cpu==2.15.0` (TF 2.16+ ships Keras 3 by default and silently breaks `TextVectorization` save/load).
542
- - **Notebook freeze** β€” IEEE artefact protected by a SHA-256 check; published BLEU stays reproducible across the project's lifetime.
543
- - **Optional dependency groups** (`[hf]`, `[eval]`, `[mlflow]`, `[dev]`) β€” slim production image stays lean; HF baselines and metric tooling are opt-in extras.
544
- - **Decoupled experiment artefacts** β€” model weights live in HuggingFace Hub (planned), MLflow tracking on DagsHub free tier (planned). Git stays small.
545
- - **Structured logging** β€” `structlog` emits JSON in production, pretty colourised logs in dev, switched by `APP_ENV`.
546
- - **No silent rewrites** β€” every notebook β†’ module move is documented with a cell mapping in [`docs/PHASE_1_NOTES.md`](docs/PHASE_1_NOTES.md); behavioural quirks (e.g. `compute_loss_and_acc` ignoring its `training` argument) are preserved verbatim with code comments referencing the doc.
547
 
548
- ---
 
549
 
550
- ## Limitations
 
551
 
552
- - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
553
- - The modular pipeline has not yet reproduced the IEEE notebook's BLEU-4 ~24 on a freshly trained checkpoint; see [Current model quality status](#current-model-quality-status). The bootstrap weights shipped under [`models/v1.0.0/`](models/v1.0.0/) are intentionally random and exist only for architectural smoke testing.
554
- - Beam search is implemented ([`inference/beam.py`](src/captioning/inference/beam.py)) and selectable per call/run, but a head-to-head benchmark against greedy on a real checkpoint is part of the in-progress Phase 1b validation, not a published result yet.
555
- - CIDEr / METEOR / ROUGE-L are implemented ([`evaluation/`](src/captioning/evaluation/)) and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
556
- - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
557
 
558
- These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
 
559
 
560
- ---
 
561
 
562
- ## Experimental evaluation pipeline
 
563
 
564
- The repository is evolving from a "research notebook reproduction" into a reproducible experimentation platform. Evaluation is no longer a single BLEU number printed at the end of training β€” it is a structured set of artefacts that any future run, including the Phase 3 multimodal baselines, can be diffed against.
 
565
 
566
- The pieces:
567
 
568
- - **[`scripts/evaluate.py`](scripts/evaluate.py)** β€” single entrypoint for full corpus evaluation. Loads a checkpoint + tokenizer, runs decoding (greedy or beam) over the COCO validation slice, computes BLEU-1..4 / CIDEr / METEOR / ROUGE-L, and writes a versioned artefact set under `results/<run_id>/`.
569
- - **[`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)** β€” per-sample diagnostic view. Prints N random predictions vs. references with sentence-level BLEU-4 / ROUGE-L, prediction length, longest repeated-token run, and a set of failure flags (`empty` / `very_short` / `repetitive` / `under_length`). Used when the aggregate metric moves but the qualitative behaviour does not.
570
- - **Benchmark runner utilities** β€” [`src/captioning/evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py) defines `RunMeta` and `write_run_artifacts(...)`, the contract every evaluation run honours. Phase 3 cross-model comparison code joins multiple `results/<run_id>/` directories without bespoke parsers per model.
571
- - **Greedy vs. beam evaluation support** β€” the same evaluator accepts `--decode-strategy greedy|beam` plus beam-search controls (`--beam-width`, `--length-penalty`, `--no-repeat-ngram-size`), so a single command-line difference produces directly comparable artefact sets for the same checkpoint. Beam-search implementation lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py).
572
- - **`metrics.json` outputs** β€” every evaluation writes a typed metric report (BLEU-1..4, ROUGE-L, METEOR, CIDEr) plus run metadata in machine-readable form. The Phase 3 comparison plots will read these files directly; no per-run hand-typing of numbers into spreadsheets.
573
- - **`diagnostics.jsonl` inspection flow** β€” the same per-sample diagnostic rows that `scripts/inspect_predictions.py` prints to stdout are emitted as JSONL alongside the metrics. The downstream loader is whatever pandas / DuckDB query happens to be useful that day, instead of a bespoke parser per investigation.
574
 
575
- ### Current limitations
576
 
577
- - **No fresh fully-trained stabilized checkpoint is committed yet.** The stabilized training workflow exists in code; the resulting weights file does not yet sit under [`models/v1.0.0/`](models/v1.0.0/).
578
- - **Current repo weights are bootstrap/dev artefacts** β€” see [Current model quality status](#current-model-quality-status). They exist for serving-stack smoke tests, not for producing usable captions.
579
- - **Benchmark numbers from the modular pipeline are not yet finalized.** The metric harness is in place; the matching checkpoint to publish numbers from is not.
580
- - **Phase 3 multimodal baselines (BLIP / ViT-GPT2 / GIT) are planned** specifically because the original CNN + Transformer architecture has a quality ceiling that no amount of decoding tuning or schedule tweaking will lift past modern foundation-model baselines. Stabilization here is the floor; Phase 3 is the path past it.
581
 
582
  ---
583
 
584
- ## Roadmap
585
-
586
- - **Phase 1b** (in progress) β€” beam search βœ…, CIDEr / METEOR / ROUGE-L βœ… ([`evaluation/cider.py`](src/captioning/evaluation/cider.py), [`meteor.py`](src/captioning/evaluation/meteor.py), [`rouge.py`](src/captioning/evaluation/rouge.py)), stabilized training workflow βœ… ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)), evaluation benchmark runner βœ… ([`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)), prediction inspection tooling βœ… ([`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)). Full retraining + benchmark validation on COCO is still in progress β€” the metric harness is in place, the matching checkpoint is not yet committed.
587
- - **Phase 2A** βœ… β€” FastAPI backend, lifespan-managed predictor singleton, multipart inference endpoint, structured logging + request IDs, Pydantic schemas, Swagger/OpenAPI docs, health/readiness probe.
588
- - **Phase 2B** βœ… β€” React 19 + Vite 8 + Tailwind v4 SPA, drag/drop upload UX, live API integration against `POST /v1/captions`, env-driven `VITE_API_BASE`, `AbortController` timeouts, typed `ApiError` classification, polled health badge with auto-recovery, CORS allow-list wired through the backend YAML config.
589
- - **Phase 2C** β€” Deployment integration: HuggingFace Spaces backend, Vercel-hosted frontend, production CORS allow-list, GitHub Actions CI/CD across both packages.
590
- - **Phase 3** β€” Tier-1 multimodal upgrades: BLIP-base / ViT-GPT2 / GIT-base-coco side-by-side comparison demo with per-model BLEU + latency.
591
- - **Phase 4** β€” Sentry, Prometheus, DagsHub-hosted MLflow link, Architecture Decision Records (`docs/adr/`).
592
- - **Future work** β€” ViT + Transformer fine-tune on COCO; VLM API integration (Anthropic Claude vision) behind a feature flag; VQA endpoint.
593
-
594
- Detailed plan: [`docs/restructure-plan.md`](docs/restructure-plan.md).
595
-
596
- ### Current capabilities
597
-
598
- - Notebook parity preserved β€” IEEE artefact frozen by SHA-256, four-stage parity audit gates every behavioural change.
599
- - Typed modular ML package β€” Pydantic v2 configs, mypy-strict, 37 unit tests passing.
600
- - Production-style inference API β€” FastAPI app factory, lifespan-managed `CaptionPredictor` singleton, warmup on boot.
601
- - Swagger/OpenAPI testing β€” interactive `/docs` UI for hand-testing every endpoint, raw `/openapi.json` for client codegen.
602
- - Structured logging β€” JSON in production, pretty in dev; per-request UUIDs threaded through every log line.
603
- - End-to-end image upload β†’ caption flow β€” multipart upload β†’ content-type guard β†’ image decode β†’ predictor β†’ typed response with latency + request ID.
604
- - End-to-end browser inference workflow β€” React 19 + Vite 8 SPA under [`frontend/`](frontend/) wired to `POST /v1/captions`; drag/drop or click-to-browse upload, live caption + latency + request ID display.
605
- - Drag/drop upload UI β€” JPEG / PNG / WebP, 10 MB cap, keyboard-activatable (`Enter` / `Space`), client-side validation mirrored from the backend so error wording stays consistent.
606
- - Live frontend-backend integration β€” typed `ApiError` boundary, `AbortController` timeouts (3 s health / 60 s caption), CORS allow-list aligned with `serve.cors_allowed_origins`.
607
- - Polled health surface β€” `StatusBadge` reads `/healthz` every 10 s plus on window focus; recovers automatically without page reload when the backend comes back.
608
- - Responsive Tailwind v4 inference interface β€” single-column layout under the `lg` breakpoint, sticky header with live status, modular component split under [`frontend/src/components/`](frontend/src/components/).
609
- - Typed API communication β€” SPA consumes the same Pydantic `CaptionResponse` shape the backend emits; caption, `model_version`, `decode_strategy`, `latency_ms`, and `request_id` render directly from the wire payload.
610
- - Production-style frontend architecture β€” dedicated [`services/api.js`](frontend/src/services/api.js) boundary, env-driven `VITE_API_BASE` with safe fallback, lint-clean flat ESLint config, static-asset build via `npm run build`.
611
- - Beam-search decoding β€” [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) dispatched through `CaptionPredictor` alongside greedy, with length penalty, repetition penalty, and no-repeat n-gram blocking.
612
- - Multi-metric evaluation β€” corpus BLEU-1..4 plus CIDEr / METEOR / ROUGE-L under a single runner ([`src/captioning/evaluation/`](src/captioning/evaluation/)), emitted as `metrics.json` per run.
613
- - Benchmark runner β€” versioned `results/<run_id>/` artefact contract via [`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py), designed so Phase 3 cross-model comparison can join runs without bespoke parsers.
614
- - Prediction inspection tooling β€” [`scripts/inspect_predictions.py`](scripts/inspect_predictions.py) for per-sample sentence-level BLEU / ROUGE-L, length and repetition diagnostics, and failure-flag breakdown.
615
- - Stabilized training configs β€” opt-in label smoothing, cosine LR schedule, warmup steps, and dropout-free validation behind explicit flags in [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
616
- - Reproducible evaluation pipeline β€” `metrics.json` + `predictions.jsonl` + `diagnostics.jsonl` + `run_meta.json` + `report.md` per run, so any two runs can be diffed mechanically rather than re-typed into a spreadsheet.
617
 
618
  ---
619
 
620
- ## Citation
621
 
622
- If you reference this work in academic writing, please cite the IEEE paper:
623
 
624
- ```bibtex
625
- @inproceedings{ainarratives,
626
- title = {AI Narratives: Bridging Visual Content and Linguistic Expression},
627
- booktitle = {Proceedings of the IEEE Conference},
628
- publisher = {IEEE},
629
- year = {2024},
630
- url = {https://ieeexplore.ieee.org/document/10675203},
631
- }
632
- ```
633
 
634
  ---
635
 
636
- ## Acknowledgements
637
 
638
- - The model architecture, hyperparameters, and BLEU baseline are from the IEEE-published paper *AI Narratives: Bridging Visual Content and Linguistic Expression*.
639
- - COCO 2017 captions provided by the [Microsoft COCO project](https://cocodataset.org/).
640
- - TensorFlow / Keras for the model layers; Pydantic for the configuration system; sacrebleu for evaluation; Ruff, mypy, and pytest for tooling.
641
 
642
- ---
643
 
644
- ## License
645
 
646
- Released under the [MIT License](LICENSE). The IEEE paper itself is published under separate terms.
647
 
648
  ---
649
 
650
- ## Author
 
 
 
 
 
 
651
 
652
- **Apoorv Raj** β€” AI / ML systems engineer.
653
- Repository structured by phase; contributions and issues welcome.
 
 
 
 
1
+ <h1 align="center">Image Captioning System</h1>
2
 
3
+ <p align="center">
4
+ <strong>CNN + Transformer image-to-language pipeline, lifted from an IEEE-published research notebook into a typed, tested, full-stack production codebase.</strong>
 
 
 
 
 
5
  </p>
6
 
7
+ <p align="center">
8
+ <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white">
9
+ <img alt="TensorFlow 2.15" src="https://img.shields.io/badge/TensorFlow-2.15-FF6F00?style=flat-square&logo=tensorflow&logoColor=white">
10
+ <img alt="FastAPI" src="https://img.shields.io/badge/FastAPI-0.111-009688?style=flat-square&logo=fastapi&logoColor=white">
11
+ <img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-E92063?style=flat-square&logo=pydantic&logoColor=white">
12
+ <img alt="React 19" src="https://img.shields.io/badge/React-19-61DAFB?style=flat-square&logo=react&logoColor=black">
13
+ <img alt="Vite 8" src="https://img.shields.io/badge/Vite-8-646CFF?style=flat-square&logo=vite&logoColor=white">
14
  </p>
15
 
16
+ <p align="center">
17
+ <img alt="Ruff" src="https://img.shields.io/badge/lint-ruff-261230?style=flat-square&logo=ruff&logoColor=white">
18
+ <img alt="mypy strict" src="https://img.shields.io/badge/typed-mypy%20strict-1F5082?style=flat-square">
19
+ <img alt="Tests" src="https://img.shields.io/badge/tests-90%20passing-brightgreen?style=flat-square">
20
+ <img alt="Pre-commit" src="https://img.shields.io/badge/pre--commit-enabled-FAB040?style=flat-square&logo=pre-commit&logoColor=white">
21
+ <img alt="IEEE Published" src="https://img.shields.io/badge/IEEE-published-00629B?style=flat-square&logo=ieee&logoColor=white">
22
+ <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-blue?style=flat-square">
23
  </p>
24
 
25
+ <p align="center">
26
+ A deliberately scoped multimodal-AI showcase that takes a published research notebook and turns it into the kind of codebase a serving team would actually maintain β€” typed configuration, a structured FastAPI inference service, a polished React SPA, a parity-audit gate against the original notebook, and an honest roadmap that names what is shipped and what is not.
 
 
 
 
27
  </p>
28
 
29
  ---
30
 
31
+ ## Status
32
+
33
+ > 🚧 **Active build.** The research β†’ modular conversion (Phase 1) is complete and the full inference stack (Phase 2A backend + 2B frontend) is operational end-to-end: a React 19 / Vite 8 SPA posts multipart uploads to `POST /v1/captions`, the FastAPI service returns a typed `CaptionResponse`, and the lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check. A four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C (public deployment) is now in flight β€” workstream **D (backend test suite)** is complete: 12 new FastAPI route tests use a duck-typed fake predictor service to cover the full 200 / 400 / 413 / 415 / 422 / 503 contract end-to-end without loading TensorFlow, dropping the backend slice from a cold-start liability to a 0.3-second suite. The remaining workstreams (Dockerfile, HuggingFace Hub weights hosting, HF Spaces deploy, Vercel deploy, production CORS, GitHub Actions CI/CD, runbook) are sequenced in the [Roadmap](#-roadmap) below.
34
 
35
+ > ⚠️ **Caption quality disclaimer.** The weights committed under [`models/v1.0.0/`](models/v1.0.0/) are **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py): the architecture is wired correctly but every weight is randomly initialised. They exist to exercise the serving stack (lifespan, predictor wiring, multipart upload, frontend integration) before a real COCO-trained checkpoint is dropped in. Live captions therefore look like noise today β€” that is the *intended* state of the bootstrap path, not a regression. See [Current model quality status](#-current-model-quality-status) for what is being done about it.
36
 
37
+ ---
38
 
39
+ ## πŸ“Œ What Is This Project?
40
 
41
+ Image Captioning System is a research-to-production conversion of the IEEE paper *"AI Narratives: Bridging Visual Content and Linguistic Expression"*. The original work β€” a Kaggle notebook training an InceptionV3-encoder + multi-head Transformer-decoder on MS COCO β€” is preserved verbatim as the canonical research artefact. Around it sits a typed Python package, a FastAPI inference service, and a React SPA that together turn the published model into something a serving team could actually run, version, and reason about.
 
 
 
 
 
 
 
42
 
43
+ It is **not** a hosted product (yet β€” Phase 2C is shipping that), and it is **not** a thin Streamlit wrapper around `model.predict`. What this project *is* is a deliberate engineering showcase aimed at hiring teams evaluating ML, multimodal-AI, and backend skills, and at anyone who has ever wondered what it actually takes to lift a research notebook into a codebase the rest of an engineering org can build on. Every architectural decision in this repository is one I can defend in an interview.
44
 
45
  ---
46
 
47
+ ## 🎯 Why It Matters
48
 
49
+ Research notebooks and production ML systems are different artefacts with different audiences. A notebook proves an idea works. A production system has to **survive being maintained** β€” by people who did not write it, on schedules nobody planned, against inputs the original author never anticipated. The hardest part of an ML career is not getting a model to converge once; it is making the resulting pipeline *legible, typed, testable, deployable, and replaceable* without losing the behaviour the paper claimed.
50
 
51
+ This project demonstrates that conversion end-to-end at a scale one engineer can build and reason about:
 
 
52
 
53
+ - **Parity-gated refactor** β€” the notebook stays byte-stable and a four-stage audit script asserts the modular package reproduces the notebook's behaviour at every behavioural seam.
54
+ - **Strict typed configuration** β€” Pydantic v2 with `extra="forbid"` so a typo in a hyperparameter is a load-time error, not a silent training run that produces wrong numbers.
55
+ - **Lifespan-managed inference** β€” one warm `CaptionPredictor` shared across every HTTP request, not a graph rebuilt per call.
56
+ - **Train/serve shared preprocessing** β€” the same `preprocess_image_tensor` runs in `tf.data` pipelines and at inference, so the bytes that enter the model in training are byte-identical to the bytes that enter it at serve time.
57
+ - **Stabilized training experiments behind ablatable flags** β€” every quality intervention is opt-in, so any delta between two runs is attributable to one named change rather than a tangled rewrite.
58
+ - **Reproducible benchmarking** β€” every evaluation writes a machine-readable `metrics.json` + `diagnostics.jsonl` set, so two checkpoints (or one checkpoint with two decoders) can be diffed without bespoke parsers.
59
 
60
  ---
61
 
62
+ ## πŸ’‘ What This Project Demonstrates
63
+
64
+ - Lifting a research notebook into an **installable, typed Python package** (`src/` layout) without breaking the published architecture.
65
+ - A production-style **FastAPI** inference service with lifespan-managed model loading, structured logging, request-ID propagation, and a typed Pydantic schema for every payload.
66
+ - A polished **React 19 + Vite 8 + Tailwind v4** SPA with drag-and-drop upload, client-side validation, `AbortController` timeouts, typed `ApiError` classification, and a polled health badge.
67
+ - **Pydantic v2 strict configuration** with YAML + env-var overrides and `extra="forbid"` to eliminate the silent-defaults failure mode.
68
+ - **Custom multi-head Transformer decoder** with masked sparse-categorical cross-entropy, masked accuracy, learned (not sinusoidal) positional embeddings, and the IEEE paper's exact dropout / head configuration.
69
+ - **Beam search decoder** with length normalisation and n-gram repetition suppression alongside greedy, selectable per inference call and per evaluation run.
70
+ - **Corpus-level metric suite** β€” BLEU-1..4 (sacrebleu), CIDEr, METEOR, ROUGE-L β€” emitted as one typed artefact per run.
71
+ - **Notebook freeze + parity audit** β€” SHA-256 lock on the IEEE notebook plus a four-stage inline re-implementation that fails CI if the modular path drifts.
72
+ - **Pre-commit governance** β€” Ruff, mypy (strict), `nbstripout`, `gitleaks`, line-ending and TOML/YAML hygiene, all enforced before commits land.
73
+ - **Clean Git workflow** with Conventional Commits and small, reviewable changesets ([`CLAUDE.md`](CLAUDE.md) codifies the contribution rules).
74
+
75
+ ---
76
+
77
+ ## πŸ—οΈ Architecture
78
+
79
+ ```
80
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
81
+ β”‚ React 19 + Vite 8 SPA β”‚
82
+ β”‚ Tailwind v4 Β· AbortController Β· ApiError β”‚
83
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
84
+ β”‚ multipart/form-data
85
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
86
+ β”‚ FastAPI 0.111 (Pydantic v2) β”‚
87
+ β”‚ RequestContextMiddleware Β· /healthz Β· /v1/captions β”‚
88
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
+ β”‚
90
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
91
+ β”‚ PredictorService (anyio thread) β”‚
92
+ β”‚ bytes β†’ tensor β†’ predict β†’ caption β”‚
93
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
94
+ β”‚ singleton, warmed in lifespan
95
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
96
+ β”‚ CaptionPredictor (TensorFlow) β”‚
97
+ β”‚ InceptionV3 β†’ TF encoder β†’ TF decoder β†’ tokenizer β”‚
98
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
99
+ β”‚
100
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
101
+ β”‚ models/vX.Y.Z/ artefacts β”‚
102
+ β”‚ model.h5 Β· vocab.json (versioned) β”‚
103
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
104
+
105
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
106
+ β”‚ configs/*.yaml (Pydantic v2, extra="forbid") β”‚
107
+ β”‚ drives training, evaluation, AND serving β”‚
108
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
109
+ ```
110
+
111
+ ### Model topology
112
 
113
  ```
114
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
115
+ β”‚ Input image │──▢│ InceptionV3 │──▢│ Transformer │──▢│ Transformer │──▢│ Caption β”‚
116
+ β”‚ 299Γ—299Γ—3 β”‚ β”‚ encoder β”‚ β”‚ encoder β”‚ β”‚ decoder β”‚ β”‚ string β”‚
117
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (ImageNet, β”‚ β”‚ (1 layer, β”‚ β”‚ (2 layers, β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
118
+ β”‚ frozen) β”‚ β”‚ 1 head) β”‚ β”‚ 8 heads) β”‚
119
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
120
  β–Ό β–Ό β–Ό
121
+ [B, 64, 2048] [B, 64, 512] [B, T, vocab=15000]
 
122
  ```
123
 
124
  ### Components
125
 
126
+ - **CNN encoder** β€” [`models/encoder_cnn.py`](src/captioning/models/encoder_cnn.py). Pretrained InceptionV3 with the classification head removed; output reshaped to 64 spatial positions Γ— 2048 channels. Weights frozen during training.
127
+ - **Transformer encoder** β€” [`models/transformer_encoder.py`](src/captioning/models/transformer_encoder.py). Single layer, one attention head. Projects InceptionV3 features into the decoder's embedding dimension.
128
+ - **Embeddings** β€” [`models/embeddings.py`](src/captioning/models/embeddings.py). Sum of token + *learned* positional embeddings, preserved verbatim from the published architecture.
129
+ - **Transformer decoder** β€” [`models/transformer_decoder.py`](src/captioning/models/transformer_decoder.py). Causal self-attention over partial captions, cross-attention over image features, feed-forward sub-block. 8 heads, `embedding_dim=512`, dropouts (0.1 / 0.3 / 0.5) preserved from the IEEE configuration.
130
  - **Captioning model** β€” [`models/captioning_model.py`](src/captioning/models/captioning_model.py). Custom `train_step` / `test_step` with masked sparse-categorical cross-entropy and masked accuracy.
131
+ - **Tokenizer** β€” [`preprocessing/tokenizer.py`](src/captioning/preprocessing/tokenizer.py). `CaptionTokenizer` wraps `tf.keras.layers.TextVectorization`; persists vocabulary as both pickle (notebook-compatible) and JSON sidecar.
132
+ - **Inference** β€” [`inference/predictor.py`](src/captioning/inference/predictor.py). `CaptionPredictor.from_artifacts(weights, vocab, config)` loads everything once at boot, exposes `predict_path(...)` and `predict_tensor(...)` for stateless calls, and `warmup()` to amortise first-request latency.
133
+ - **Configuration** β€” [`config/schema.py`](src/captioning/config/schema.py). Pydantic v2 (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`); strict so typos in YAML or env vars become load-time errors.
134
+
135
+ **Why a monolith on a single process?** Splitting training, evaluation, and serving across services would burn the project's budget on Kubernetes manifests instead of the things a reviewer can actually click. A layered package + one FastAPI app captures the same separation-of-concerns thinking with a tenth of the operational surface area, and the seams are placed so pulling serving into its own container (Phase 2C) is a deployment change, not a refactor.
136
+
137
+ **Why TensorFlow 2.15 specifically?** TF 2.16 ships Keras 3 by default and silently breaks `TextVectorization` save/load β€” the project's `tensorflow-cpu==2.15.0` pin is deliberate. Documented in [`requirements.txt`](requirements.txt) and in the engineering-decisions section below.
138
 
139
  ---
140
 
141
+ ## πŸ–ΌοΈ Sample outputs
142
 
143
  | Image | Generated caption |
144
  |---|---|
145
  | ![](https://github.com/user-attachments/assets/64e8412b-1d49-404c-a5b2-1da121b224e2) | *a man is standing on a beach with a surfboard* |
146
  | ![](https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd) | *a man riding a motorcycle on a street* |
147
 
148
+ Outputs above are from the IEEE notebook; the modular pipeline reproduces these via the parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)). Live captions from the current bootstrap weights will *not* match β€” see [Current model quality status](#-current-model-quality-status).
149
+
150
+ ---
151
+
152
+ ## πŸ“š Research backing
153
+
154
+ The model architecture and the BLEU-4 ~24 baseline below come from the IEEE paper and its accompanying notebook:
155
+
156
+ - **Paper:** [AI Narratives: Bridging Visual Content and Linguistic Expression](https://ieeexplore.ieee.org/document/10675203) (IEEE)
157
+ - **Original notebook:** [Kaggle β€” image-captioning-using-dl](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
158
+ - **Frozen artefact in this repo:** [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](notebooks/01_ieee_inceptionv3_transformer.ipynb) β€” byte-stable; pre-commit + CI enforce its SHA-256.
159
+
160
+ The notebook is preserved verbatim as the canonical research artefact. Improvements happen in the modular package; the notebook does not.
161
 
162
  ---
163
 
164
+ ## πŸ“Š Performance
165
 
166
  | Metric | Value | Source |
167
  |---|---|---|
168
+ | BLEU-4 (IEEE baseline) | ~24 | Reported in the IEEE paper / Kaggle notebook |
169
+ | Vocabulary size | 15,000 tokens | `TextVectorization` adapt over preprocessed COCO captions |
170
  | Training set | ~120k captions sampled from COCO 2017 | `data.sample_size` in [`configs/base.yaml`](configs/base.yaml) |
171
  | Image resolution | 299 Γ— 299 (InceptionV3) | [`preprocessing/image.py`](src/captioning/preprocessing/image.py) |
172
  | Max caption length | 40 tokens | `model.max_length` in [`configs/base.yaml`](configs/base.yaml) |
173
+ | Backend test suite | 12 tests Β· 0.3 s Β· no TF loaded | [`backend/app/tests/`](backend/app/tests/) |
174
+ | Full suite | **90 tests passing** | `pytest` (unit + backend + parity) |
175
 
176
+ > Re-training on the modular pipeline is a Phase 1b deliverable; once a fresh checkpoint exists, this table will publish corpus BLEU-1..4, CIDEr, METEOR, and ROUGE-L (the harnesses already exist under [`evaluation/`](src/captioning/evaluation/)).
177
 
178
  ---
179
 
180
+ ## ⚠️ Current model quality status
181
 
182
  The frontend, backend, and inference pipeline are operational end-to-end against the modular package, but **caption quality from the current modular pipeline is still below expectations**. The IEEE notebook reported BLEU-4 ~24; a freshly trained checkpoint produced by the modular trainer has not yet reproduced that figure on COCO. The serving stack is production-style and ready for a real checkpoint β€” what is missing is the checkpoint itself.
183
 
 
188
  - **Decoding improvements** β€” replacing greedy-only generation with beam search, repetition controls, and length normalisation.
189
  - **Reproducible benchmarking** β€” emitting one consistent artefact set per evaluation run so any two runs (or any two models) can be diffed without bespoke parsing per checkpoint.
190
 
191
+ The weights currently committed under [`models/v1.0.0/`](models/v1.0.0/) are the **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py). Captions returned by the live API today will look like noise; that is the *intended* state of the bootstrap path, not a regression. Poor caption quality at this stage is expected until a properly COCO-trained checkpoint replaces those files.
192
 
193
+ This gap is being addressed through the **stabilized training workflow** at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml), which gates convergence-stability primitives behind explicit, ablatable flags rather than rewriting the baseline.
194
 
195
  ### Accuracy investigation (ongoing)
196
 
197
+ - **Greedy decoding limited caption quality and diversity.** Argmax-per-step routinely picked the locally-most-probable token regardless of how that affected the overall sequence likelihood, biasing outputs toward a small "safe captions" basin. Beam-search infrastructure now lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) and dispatches through `CaptionPredictor` alongside the existing greedy path; decode strategy is selectable per inference call and per evaluation run.
198
+ - **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
 
 
199
  - **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
200
+ - **Training stabilization experiments** are introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
201
  - label smoothing (`train.label_smoothing`),
202
  - cosine LR schedule (`train.lr_schedule: cosine`),
203
  - warmup steps (`train.warmup_steps`),
204
  - dropout-free validation path (`train.honour_training_flag_in_test_step`).
 
205
 
206
+ A complete experimental training config β€” not a thin override β€” lives at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml). It is byte-for-byte identical to [`configs/base.yaml`](configs/base.yaml) except for those four flags, so any quality delta between the two runs is attributable to those flags alone.
207
+
208
+ ---
209
+
210
+ ## πŸ› οΈ Tech Stack
211
+
212
+ | Layer | Technologies |
213
+ |---|---|
214
+ | **Core ML** | Python 3.10–3.12, TensorFlow-CPU 2.15.0 (pinned), NumPy, Pillow |
215
+ | **Model** | InceptionV3 encoder (frozen) + custom multi-head Transformer decoder |
216
+ | **Backend** | FastAPI 0.111, Pydantic v2, `pydantic-settings` 2.x, structlog 24, anyio 4 |
217
+ | **Frontend** | React 19, Vite 8, Tailwind v4, ESLint flat config |
218
+ | **Evaluation** | sacrebleu, custom CIDEr / METEOR / ROUGE-L implementations |
219
+ | **Tooling** | Ruff (lint + format), mypy (strict), pytest 8, pre-commit, nbstripout, gitleaks |
220
+ | **Infra (planned, Phase 2C)** | HuggingFace Hub (weights), HuggingFace Spaces (backend), Vercel (frontend), GitHub Actions (CI/CD) |
221
 
222
  ---
223
 
224
+ ## πŸ“ Repository Structure
225
 
226
  ```
227
  image-captioning-system/
 
244
  β”œβ”€β”€ backend/ # Phase 2A β€” FastAPI inference service
245
  β”‚ └── app/
246
  β”‚ β”œβ”€β”€ main.py # App factory + lifespan-managed predictor singleton
247
+ β”‚ β”œβ”€β”€ api/routes.py # Thin HTTP β€” /healthz, /v1/captions
248
+ β”‚ β”œβ”€β”€ core/ # BackendSettings, structlog setup, RequestContextMiddleware
249
+ β”‚ β”œβ”€β”€ schemas/ # Pydantic request/response models
250
+ β”‚ β”œβ”€β”€ services/predictor_service.py # bytes β†’ caption + latency (anyio thread offload)
251
+ β”‚ β”œβ”€β”€ utils/image.py # Content-type allow-list + ImageDecodeError
252
+ β”‚ └── tests/ # Phase 2C WS-D β€” 12 route tests, no TF loaded
253
  β”‚
254
  β”œβ”€β”€ frontend/ # Phase 2B β€” React 19 + Vite 8 + Tailwind v4 SPA
255
+ β”‚ β”œβ”€β”€ vite.config.js Β· eslint.config.js Β· package.json Β· .env.example
 
 
 
 
 
256
  β”‚ └── src/
257
+ β”‚ β”œβ”€β”€ main.jsx Β· App.jsx Β· index.css
258
+ β”‚ β”œβ”€β”€ services/api.js # checkHealth / captionImage β€” AbortController + typed ApiError
 
 
 
259
  β”‚ └── components/
260
+ β”‚ β”œβ”€β”€ Header.jsx Β· StatusBadge.jsx # Sticky brand bar + 10s health poller
261
+ β”‚ β”œβ”€β”€ UploadZone.jsx Β· ImagePreview.jsx
262
+ β”‚ β”œβ”€β”€ CaptionResult.jsx Β· ErrorBanner.jsx Β· Spinner.jsx
 
 
 
 
263
  β”‚
264
  β”œβ”€β”€ configs/
265
+ β”‚ β”œβ”€β”€ base.yaml # IEEE hyperparameters (notebook cell 6 mirror)
266
  β”‚ └── train/
267
+ β”‚ β”œβ”€β”€ debug.yaml # CI smoke override (1 epoch, 64 captions)
268
+ β”‚ └── stabilized.yaml # Phase 1b stability experiment (4 ablatable flags)
269
  β”‚
270
  β”œβ”€β”€ scripts/
271
  β”‚ β”œβ”€β”€ train.py Β· evaluate.py Β· predict.py
272
+ β”‚ β”œβ”€β”€ inspect_predictions.py # Per-sample diagnostics + diagnostics.jsonl
273
  β”‚ β”œβ”€β”€ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
274
+ β”‚ └── notebook_module_audit.py # 4-stage parity gate vs. notebook
 
 
 
 
 
 
 
 
 
275
  β”‚
276
+ β”œβ”€β”€ tests/unit/ # 78 unit tests (parity, tokenizer, eval, splits, …)
277
+ β”œβ”€β”€ docs/ # restructure-plan Β· PHASE_0_NOTES Β· PHASE_1_NOTES Β· STABILIZED_TRAINING_RUNBOOK
278
  β”œβ”€β”€ pyproject.toml Β· requirements*.txt Β· Makefile
279
  β”œβ”€β”€ .pre-commit-config.yaml Β· .python-version Β· .env.example
280
+ β”œβ”€β”€ .paper-notebook.sha256 # Locked notebook hash for the freeze check
281
+ β”œβ”€β”€ CLAUDE.md # Contribution + commit governance
282
  └── README.md
283
  ```
284
 
285
  ---
286
 
287
+ ## πŸš€ Quick Start
288
 
289
+ ### Prerequisites
290
 
291
+ - Python **3.10 – 3.12** (TensorFlow 2.15 has no 3.13 wheels)
292
+ - Node **20+**
293
+ - Git
294
+
295
+ ### Backend
296
 
297
  ```powershell
298
+ # PowerShell (Windows)
299
  py -3.10 -m venv .venv
300
  .venv\Scripts\activate
301
  pip install -r requirements-dev.txt -r requirements-eval.txt
 
303
  pre-commit install
304
  ```
305
 
 
 
306
  ```bash
307
+ # bash (Linux / macOS)
308
  python3.10 -m venv .venv
309
  source .venv/bin/activate
310
  pip install -r requirements-dev.txt -r requirements-eval.txt
 
312
  pre-commit install
313
  ```
314
 
315
+ Boot the API:
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
  ```bash
318
+ uvicorn --app-dir backend app.main:app --host 0.0.0.0 --port 8000
 
 
 
 
319
  ```
320
 
321
+ Interactive Swagger UI is live at **http://localhost:8000/docs**; raw OpenAPI 3.1 at **http://localhost:8000/openapi.json**.
 
 
 
 
322
 
323
+ ### Frontend
324
 
325
  ```bash
326
+ cd frontend
327
+ npm install
328
+ npm run dev
 
 
 
329
  ```
330
 
331
+ The SPA is live at **http://localhost:5173** (Vite picks the next free port if 5173 is busy). `VITE_API_BASE` (see [`frontend/.env.example`](frontend/.env.example)) points it at any backend origin; absent the env var, it falls back to `http://127.0.0.1:8000`.
 
 
 
 
332
 
333
+ ### Tests
334
 
335
+ ```bash
336
+ pytest -q # All 90 tests (unit + backend + parity)
337
+ pytest backend/app/tests/ -v # Backend route tests only (0.3 s, no TF loaded)
338
+ make freeze-paper-notebook # Asserts the IEEE notebook SHA-256 has not changed
 
 
 
 
 
 
 
 
 
339
  ```
340
 
341
+ ### One-shot caption (CLI)
342
 
343
  ```bash
344
  python -m scripts.predict \
 
348
  --image samples/photo.jpg
349
  ```
350
 
351
+ ### One-shot caption (HTTP)
 
 
352
 
353
  ```bash
354
+ curl -X POST http://localhost:8000/v1/captions -F "image=@samples/photo.jpg"
 
 
 
 
 
 
 
 
355
  ```
356
 
357
+ ### Reproduce training
 
 
 
 
358
 
359
  ```bash
360
+ python -m scripts.train --config configs/base.yaml
361
+ # Or with the stabilization experiment flags enabled:
362
+ python -m scripts.train --config configs/base.yaml --override configs/train/stabilized.yaml
363
+ # Or a 64-caption CI smoke run:
364
+ python -m scripts.train --config configs/base.yaml --override configs/train/debug.yaml
365
  ```
366
 
367
+ Outputs (`weights.h5`, `vocab.pkl` + `vocab.json` sidecar, `history.json`, `training_log.csv`) land under `outputs/runs/latest/` by default.
368
+
369
+ `make help` lists every available command (lint, format, type-check, test, train, serve, evaluate, predict, Docker, freeze-paper-notebook, …).
370
 
371
  ---
372
 
373
+ ## 🌐 FastAPI backend (Phase 2A)
374
 
375
+ Phase 2A delivers a production-style inference service rather than a thin demo wrapper:
376
 
377
  - **App factory + lifespan** β€” [`backend/app/main.py`](backend/app/main.py). `create_app()` builds the FastAPI instance; the lifespan loads the YAML `AppConfig`, instantiates a `CaptionPredictor`, calls `warmup()`, and stashes a `PredictorService` singleton on `app.state` so every request reuses one warm model.
378
  - **Routes** β€” [`backend/app/api/routes.py`](backend/app/api/routes.py). Intentionally thin: validate inputs, delegate, shape the response. No TF imports leak into the HTTP layer.
379
+ - **Service layer** β€” [`backend/app/services/predictor_service.py`](backend/app/services/predictor_service.py). Wraps the predictor, decodes uploaded bytes off the event loop via `anyio.to_thread.run_sync`, measures per-request latency, returns `(caption, latency_ms)`.
380
+ - **Schemas** β€” [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py). Pydantic v2 (`CaptionResponse`, `HealthResponse`, `ErrorResponse`); every payload that crosses the wire is typed and OpenAPI-documented.
381
+ - **Backend settings** β€” [`backend/app/core/config.py`](backend/app/core/config.py). Separate `BackendSettings` (env-overridable: weights path, tokenizer dir, model version, warmup toggle) layered on top of the research-side `AppConfig`. Research hyperparameters and serving knobs change on different cadences and live in different settings objects.
382
  - **Structured logging + request IDs** β€” [`backend/app/core/logging.py`](backend/app/core/logging.py). `RequestContextMiddleware` stamps each request with a UUID; `structlog` carries it through every log line so a single failed caption can be traced end-to-end.
383
+ - **Image safety** β€” [`backend/app/utils/image.py`](backend/app/utils/image.py). Content-type allow-list (JPEG / PNG / WebP / BMP), explicit `ImageDecodeError` so malformed bytes produce a clean 422 rather than a 500.
 
 
384
 
385
  | Method | Path | Purpose |
386
  |---|---|---|
 
389
  | `GET` | `/docs` | Interactive Swagger UI, auto-generated from the Pydantic schemas. |
390
  | `GET` | `/openapi.json` | Raw OpenAPI 3.1 spec for client codegen. |
391
 
392
+ `POST /v1/captions` enforces input validation at the boundary: **415** on disallowed content types, **413** on oversized uploads (`serve.max_upload_bytes`), **422** on undecodable image bytes, **400** on empty uploads, **503** while the predictor is still loading during a rolling restart. All six status codes are covered by the [`backend/app/tests/`](backend/app/tests/) suite added in Phase 2C WS-D.
 
 
 
 
 
 
 
 
 
 
393
 
394
  ---
395
 
396
+ ## 🎨 Frontend UI (Phase 2B)
397
 
398
+ Phase 2B ships a single-page inference UI under [`frontend/`](frontend/) β€” not a styled demo. The split mirrors the backend's separation between transport, service, and presentation:
399
 
400
+ - **Application shell** β€” [`frontend/src/App.jsx`](frontend/src/App.jsx). Owns the request lifecycle (selected file β†’ preview β†’ generate β†’ result). The preview `URL.createObjectURL` is `useMemo`-derived and revoked through an effect cleanup so previews never leak across uploads. Four `useState` slots (`file`, `result`, `error`, `loading`) cover every UI state β€” no Redux, no React Query, no context.
401
+ - **API service layer** β€” [`frontend/src/services/api.js`](frontend/src/services/api.js). Single boundary for every backend call. Reads `import.meta.env.VITE_API_BASE` once at module load (falls back to `http://127.0.0.1:8000`), wraps `fetch` with `AbortController`-driven timeouts (3 s for `/healthz`, 60 s for `/v1/captions`), and classifies failures into `timeout` / `network` / `http` / `unknown` kinds on a typed `ApiError`.
402
+ - **Upload zone** β€” [`frontend/src/components/UploadZone.jsx`](frontend/src/components/UploadZone.jsx). Drag/drop + click-to-browse + keyboard activation. Validates content-type (JPEG / PNG / WebP) and size (10 MB) before the file ever touches the network β€” invalid uploads are rejected client-side with the same wording the backend would have returned.
403
+ - **Status badge** β€” [`frontend/src/components/StatusBadge.jsx`](frontend/src/components/StatusBadge.jsx). Polls `/healthz` every 10 seconds and on window focus, runs a three-state machine (`checking` / `online` / `offline`), recovers automatically when the backend comes back.
 
 
404
  - **Error banner** β€” [`frontend/src/components/ErrorBanner.jsx`](frontend/src/components/ErrorBanner.jsx). Single surface for every failure class. Reads `ApiError.message` so the user sees "Cannot reach backend" or "Request timed out" instead of a raw browser error.
405
+ - **Caption result** β€” [`frontend/src/components/CaptionResult.jsx`](frontend/src/components/CaptionResult.jsx). Consumes the backend's typed `CaptionResponse` directly: caption text plus model version, decode strategy, latency, and the request ID echoed from the `x-request-id` header.
 
 
406
 
407
  ```
408
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” drag/drop β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” validate β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 
416
  β”‚ typed CaptionResponse / ApiError
417
  β–Ό
418
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
419
+ β”‚ CaptionResult / β”‚
420
  β”‚ ErrorBanner β”‚
421
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
422
  ```
423
 
424
+ Frontend and backend are deployed independently. The SPA only knows the backend's origin via `VITE_API_BASE`; the backend only trusts SPAs whose origin appears in `serve.cors_allowed_origins`. Dev origins are pre-allowed in [`configs/base.yaml`](configs/base.yaml); production origins join the same list at deploy time (Phase 2C WS-F). No shared build, no shared runtime β€” only the typed Pydantic schemas in [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py) cross the wire.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
425
 
426
  ---
427
 
428
+ ## βš™οΈ Configuration system
429
 
430
+ Hyperparameters are not globals. They live in YAML validated by Pydantic v2:
431
 
432
  ```yaml
433
  # configs/base.yaml β€” mirrors the IEEE notebook cell 6 verbatim
 
453
  Three load-time guarantees:
454
 
455
  1. **Type validation.** `batch_size: "64"` (string instead of int) raises a `ValidationError` pointing at the field, not a downstream tensor-shape error.
456
+ 2. **No silent typos.** `extra="forbid"` rejects unknown keys β€” typos in ML hyperparameters silently using defaults is the worst failure mode, and `extra="forbid"` eliminates it.
457
  3. **Env overrides.** `CAPTIONING__TRAIN__BATCH_SIZE=32` overrides at any nesting depth β€” useful for CI smoke tests, ablations, and serve-time tuning without rebuilding images.
458
 
459
+ Schema in [`src/captioning/config/schema.py`](src/captioning/config/schema.py); loader in [`src/captioning/config/loader.py`](src/captioning/config/loader.py).
460
 
461
  ---
462
 
463
+ ## πŸ§ͺ Testing & code quality
464
 
465
  ```bash
466
+ make test # pytest β€” 90/90 (unit + backend route tests + parity)
467
  make lint # Ruff lint + format check
468
  make typecheck # mypy strict on src/captioning + scripts
469
  make pre-commit # All hooks across all files
 
473
  | Layer | Tool | Status |
474
  |---|---|---|
475
  | Lint + format | [Ruff](https://docs.astral.sh/ruff/) (replaces black + isort + flake8) | βœ… clean |
476
+ | Type-check | [mypy](https://mypy.readthedocs.io/) with `pandas-stubs`, `types-PyYAML`, `types-requests` | βœ… 0 errors |
477
+ | Tests | pytest + pytest-cov + pytest-asyncio | βœ… 90 passing |
478
  | Notebook hygiene | [`nbstripout`](https://github.com/kynan/nbstripout) (pre-commit) | βœ… outputs stripped on commit |
479
  | Secret scanning | [`gitleaks`](https://github.com/gitleaks/gitleaks) (pre-commit) | βœ… enabled |
480
+ | Notebook integrity | SHA-256 freeze via [`make freeze-paper-notebook`](Makefile) | βœ… locked |
481
  | Parity audit | [`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py) β€” 4 stages | βœ… all passing |
482
 
483
  The parity audit re-implements four notebook stages inline (caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, decoder forward pass) and asserts the modular path produces byte-identical (or `tf.allclose`-identical) output. It is the contract that gates any behavioural improvement.
484
 
485
+ The backend test suite ([`backend/app/tests/`](backend/app/tests/)) introduced in Phase 2C WS-D uses a duck-typed `FakePredictorService` to exercise every status code in the `/v1/captions` contract β€” 200 / 400 / 413 / 415 / 422 / 503 β€” plus the `/healthz` readiness flip and `x-request-id` propagation, all without loading TensorFlow. The full backend slice runs in **0.3 seconds**.
486
+
487
  ---
488
 
489
+ ## πŸ—ΊοΈ Roadmap
490
+
491
+ ### Phase 0 β€” Bootstrap βœ…
492
+
493
+ - [x] **0A** β€” Repo scaffolding, `pyproject.toml`, Makefile, Conventional Commits
494
+ - [x] **0B** β€” Pre-commit hooks (Ruff, mypy, nbstripout, gitleaks, line-ending + TOML/YAML hygiene)
495
+ - [x] **0C** β€” Notebook freeze policy + `.paper-notebook.sha256` SHA-256 lock
496
+ - [x] **0D** β€” Pinned dependency surface (`requirements*.txt` + `pyproject.toml` extras: `hf`, `eval`, `mlflow`, `dev`)
497
+
498
+ ### Phase 1 β€” Modularisation βœ…
499
+
500
+ - [x] **1A** β€” Notebook β†’ installable `captioning` package (`src/` layout)
501
+ - [x] **1B** β€” Pydantic v2 strict config (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`) with YAML loader + env-var overrides
502
+ - [x] **1C** β€” Preprocessing modules (`caption.py`, `image.py`, `tokenizer.py`, `augmentation.py`) β€” shared train/serve preprocessing
503
+ - [x] **1D** β€” Data pipeline (`coco.py`, `splits.py`, `pipeline.py`) with seeded sampling
504
+ - [x] **1E** β€” Model factory (`encoder_cnn.py`, `transformer_encoder.py`, `embeddings.py`, `transformer_decoder.py`, `captioning_model.py`, `factory.py`)
505
+ - [x] **1F** β€” Training loop (`losses.py`, `callbacks.py`, `trainer.py`) with structured logging + history serialisation
506
+ - [x] **1G** β€” Greedy inference (`predictor.py`, `image_loader.py`, `greedy.py`) with lifespan-friendly `from_artifacts(...)` + `warmup()`
507
+ - [x] **1H** β€” Notebook parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) β€” 4 stages, byte/tensor-identical
508
+ - [x] **1I** β€” Unit test suite (parity, tokenizer, evaluation, splits, hashing, image preprocessing, caption preprocessing)
509
+
510
+ ### Phase 1b β€” Training stabilization βœ… (training validation in progress)
511
+
512
+ - [x] **1b-A** β€” Beam-search decoder ([`inference/beam.py`](src/captioning/inference/beam.py)) with length normalisation + n-gram repetition suppression, selectable per call/run
513
+ - [x] **1b-B** β€” CIDEr implementation ([`evaluation/cider.py`](src/captioning/evaluation/cider.py))
514
+ - [x] **1b-C** β€” METEOR implementation ([`evaluation/meteor.py`](src/captioning/evaluation/meteor.py))
515
+ - [x] **1b-D** β€” ROUGE-L implementation ([`evaluation/rouge.py`](src/captioning/evaluation/rouge.py))
516
+ - [x] **1b-E** β€” Benchmark runner ([`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)) emitting one `metrics.json` + `diagnostics.jsonl` per run
517
+ - [x] **1b-F** β€” Per-sample inspection tool ([`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)) β€” sentence-level BLEU/ROUGE, length, longest repeated-token run, failure flags
518
+ - [x] **1b-G** β€” Stabilization config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) β€” label smoothing, cosine LR, warmup, dropout-free validation, all ablatable
519
+ - [x] **1b-H** β€” Stabilized training runbook ([`docs/STABILIZED_TRAINING_RUNBOOK.md`](docs/STABILIZED_TRAINING_RUNBOOK.md))
520
+ - [ ] **1b-I** β€” Fresh stabilized COCO-trained checkpoint committed to [`models/`](models/) (under a bumped `vX.Y.Z/`)
521
+ - [ ] **1b-J** β€” Headline numbers (BLEU-1..4, CIDEr, METEOR, ROUGE-L) published in [Performance](#-performance)
522
+
523
+ ### Phase 2A β€” FastAPI inference service βœ…
524
+
525
+ - [x] **2A-1** β€” App factory + lifespan-managed `CaptionPredictor` singleton with `warmup()` on boot
526
+ - [x] **2A-2** β€” Thin `/healthz` and `POST /v1/captions` routes with full status-code contract (200 / 400 / 413 / 415 / 422 / 503)
527
+ - [x] **2A-3** β€” Pydantic v2 schemas (`CaptionResponse`, `HealthResponse`, `ErrorResponse`) with auto-generated Swagger + OpenAPI 3.1
528
+ - [x] **2A-4** β€” `PredictorService` with `anyio.to_thread.run_sync` offload so TF inference never blocks the event loop
529
+ - [x] **2A-5** β€” Structured logging (`structlog`) + `RequestContextMiddleware` propagating `x-request-id` across log lines
530
+ - [x] **2A-6** β€” `BackendSettings` separated from research `AppConfig` (different change cadences, different env prefixes)
531
+ - [x] **2A-7** β€” Bootstrap dev artefacts script so the API boots before training has produced real weights
532
+
533
+ ### Phase 2B β€” Frontend SPA βœ…
534
+
535
+ - [x] **2B-1** β€” React 19 + Vite 8 + Tailwind v4 scaffolding, flat ESLint config with `eslint-plugin-react-hooks` + `eslint-plugin-react-refresh`
536
+ - [x] **2B-2** β€” Drag/drop + click-to-browse upload zone with keyboard activation and client-side content-type + size validation
537
+ - [x] **2B-3** β€” `services/api.js` boundary: `VITE_API_BASE` env, `AbortController` timeouts (3 s health / 60 s caption), typed `ApiError` classification
538
+ - [x] **2B-4** β€” Polled `/healthz` status badge with three-state machine, window-focus refetch, and automatic recovery
539
+ - [x] **2B-5** β€” Typed `CaptionResponse` rendering β€” caption, model version, decode strategy, latency, request ID β€” with copy-to-clipboard
540
+ - [x] **2B-6** β€” Single `ErrorBanner` surface mapping every `ApiError.kind` to actionable copy
541
+ - [x] **2B-7** β€” CORS allow-list wired through backend YAML (`serve.cors_allowed_origins`), dev origins pre-allowed
542
+
543
+ ### Phase 2C β€” Public deployment 🚧 (in progress)
544
+
545
+ - [ ] **WS-A** β€” Backend containerisation: multi-stage `Dockerfile` (python:3.11-slim, non-root, EXPOSE 7860, HEALTHCHECK) + `.dockerignore` + `.env.example`
546
+ - [ ] **WS-A4** β€” Lifespan integration with HuggingFace Hub: extend `BackendSettings` with `weights_hub_repo` / `weights_hub_revision`, call `huggingface_hub.snapshot_download` on startup when set
547
+ - [ ] **WS-B** β€” Upload trained weights + tokenizer to a HuggingFace Hub model repo
548
+ - [ ] **WS-C** β€” First manual deploy to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single worker)
549
+ - [x] **WS-D** β€” **Backend test suite** ([`backend/app/tests/`](backend/app/tests/)): 12 route tests covering the full `/healthz` + `/v1/captions` contract (200 / 400 / 413 / 415 / 422 / 503) with a duck-typed `FakePredictorService` β€” no TF loaded, full slice runs in 0.3 s
550
+ - [ ] **WS-E** β€” Frontend deploy to Vercel (static SPA, `VITE_API_BASE` baked at build time, SPA rewrites)
551
+ - [ ] **WS-F** β€” Production CORS: add the deployed Vercel origin to `serve.cors_allowed_origins`
552
+ - [ ] **WS-G** β€” GitHub Actions CI/CD:
553
+ - `ci.yml` β€” Python quality matrix (ruff, mypy, pytest on 3.10/3.11/3.12), notebook SHA-256 freeze check, frontend lint + build, concurrency cancel-in-progress, pip + npm caching
554
+ - `deploy-backend.yml` β€” gated on `needs: ci`, pushes to the HF Space
555
+ - `deploy-frontend.yml` *(optional)* β€” Vercel-native GitHub integration is the recommended path
556
+ - [ ] **WS-H** β€” README "Live Demo" section (badges swapped to live HF Space + Vercel URLs) + `docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md` + `docs/CI.md`
557
+
558
+ ### Phase 3 β€” Multimodal baselines ⏳ (planned)
559
+
560
+ - [ ] **3A** β€” Side-by-side comparison harness: original CNN + Transformer vs. BLIP-base vs. ViT-GPT2 vs. GIT-base-coco
561
+ - [ ] **3B** β€” Per-model BLEU / CIDEr / METEOR / ROUGE-L on a shared COCO slice with deterministic tokenisation
562
+ - [ ] **3C** β€” Per-model latency benchmarking (single-image, batch, CPU vs. GPU)
563
+ - [ ] **3D** β€” Comparison-result dashboard exposed through the existing SPA
564
+
565
+ ### Phase 4 β€” Observability ⏳ (planned)
566
+
567
+ - [ ] **4A** β€” Sentry error tracking on backend + frontend
568
+ - [ ] **4B** β€” Prometheus metrics (per-route latency histograms, predictor cache hits, lifespan boot duration)
569
+ - [ ] **4C** β€” DagsHub-hosted MLflow tracking link surfaced in the README
570
+ - [ ] **4D** β€” Architecture Decision Records (`docs/adr/`) β€” every non-trivial choice (TF version pin, anyio offload, env-var prefix separation, etc.) gets a one-page ADR
571
+
572
+ Detailed phase notes live under [`docs/`](docs/): [restructure plan](docs/restructure-plan.md) Β· [Phase 0 notes](docs/PHASE_0_NOTES.md) Β· [Phase 1 notes](docs/PHASE_1_NOTES.md) Β· [Stabilized training runbook](docs/STABILIZED_TRAINING_RUNBOOK.md).
573
 
574
+ ---
575
 
576
+ ## 🎯 Engineering Decisions
 
 
 
 
 
 
 
 
 
 
577
 
578
+ > **Why preserve the notebook verbatim instead of refactoring it in place?**
579
+ > The notebook is the published research artefact and the only thing that can credibly produce the BLEU-4 ~24 baseline the IEEE paper claims. Editing it would silently destroy that reproducibility. The freeze + parity-audit pattern keeps the published result anchored while the modular package evolves; if the audit ever fails, the modular path has drifted from the paper and the diff is exactly where to start debugging.
580
 
581
+ > **Why pin `tensorflow-cpu==2.15.0`?**
582
+ > TF 2.16 ships Keras 3 as the default backend, and Keras 3 silently breaks `TextVectorization` save/load β€” the tokenizer round-trip the entire serving stack depends on. The pin is documented in [`requirements.txt`](requirements.txt) and protected by the env setup commands above. Phase 3's foundation-model baselines will live in optional dependency groups so they can install on a newer TF without unpinning the research pipeline.
583
 
584
+ > **Why two separate settings objects (`AppConfig` + `BackendSettings`)?**
585
+ > Research hyperparameters (`model.*`, `train.*`, `data.*`) and serving knobs (weights path, model version, warmup toggle, request-id header) change on different cadences and have different audiences. Folding them into one object would mean every backend env var lived in a research YAML, and every research-side schema change risked breaking a deploy. Two objects with two prefixes (`CAPTIONING__*` vs `BACKEND_*`) gives each surface its own change schedule.
 
 
 
586
 
587
+ > **Why `anyio.to_thread.run_sync` for inference instead of `async def predict`?**
588
+ > TensorFlow's `predict` call is synchronous and CPU-bound. Calling it directly from an async route handler would block the event loop and starve every other request. Offloading via `anyio.to_thread.run_sync` lets the event loop keep serving health checks and concurrent uploads while the model runs.
589
 
590
+ > **Why is the bootstrap-weights script committed?**
591
+ > The serving stack (lifespan, predictor wiring, multipart upload, frontend integration) has to be verifiable before a real COCO-trained checkpoint exists. The bootstrap script makes the entire path runnable from a fresh clone, which is what lets reviewers actually evaluate the architectural work independently of the model-quality work. The captions are gibberish β€” by design β€” and the README states that prominently to keep expectations honest.
592
 
593
+ > **Why `extra="forbid"` on every config schema?**
594
+ > ML projects fail catastrophically when a typo in a hyperparameter silently uses a default. `vocabularsy_size: 30000` should be a load-time error, not a quiet retraining run on the wrong vocabulary size. Strict configs are the cheapest possible insurance against the most expensive class of bug in this domain.
595
 
596
+ > **Why ship the metric suite and beam search *before* publishing new numbers?**
597
+ > Without deterministic tokenisation + a corpus-level runner + a non-greedy decoder, any "improved" number is unfalsifiable β€” it could be a real gain, a decoding artefact, or a tokenisation difference. The harness is the prerequisite to making the next training run mean something. Publishing the bar before the harness exists is how research projects accumulate numbers nobody can reproduce.
598
 
599
+ ---
600
 
601
+ ## πŸ”¬ Experimental evaluation pipeline
 
 
 
 
 
602
 
603
+ The repository is evolving from a "research notebook reproduction" into a reproducible experimentation platform. Evaluation is no longer a single BLEU number printed at the end of training β€” it is a structured set of artefacts any future run, including the Phase 3 multimodal baselines, can be diffed against.
604
 
605
+ - **[`scripts/evaluate.py`](scripts/evaluate.py)** β€” single entrypoint for full corpus evaluation. Loads a checkpoint + tokenizer, runs decoding (greedy or beam) over the COCO validation slice, computes BLEU-1..4 / CIDEr / METEOR / ROUGE-L, and writes a versioned artefact set under `results/<run_id>/`.
606
+ - **[`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)** β€” per-sample diagnostic view. Prints N random predictions vs. references with sentence-level BLEU-4 / ROUGE-L, prediction length, longest repeated-token run, and failure flags (`empty` / `very_short` / `repetitive` / `under_length`). Used when the aggregate metric moves but the qualitative behaviour does not.
607
+ - **[`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)** β€” `RunMeta` and `write_run_artifacts(...)`, the contract every evaluation run honours. Phase 3 cross-model comparison code joins multiple `results/<run_id>/` directories without bespoke parsers per model.
608
+ - **Greedy vs. beam evaluation support** β€” the same evaluator accepts `--decode-strategy greedy|beam` plus beam-search controls (`--beam-width`, `--length-penalty`, `--no-repeat-ngram-size`), so a single command-line difference produces directly comparable artefact sets for the same checkpoint.
609
 
610
  ---
611
 
612
+ ## βš–οΈ Limitations
613
+
614
+ - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
615
+ - The modular pipeline has not yet reproduced the IEEE notebook's BLEU-4 ~24 on a freshly trained checkpoint; see [Current model quality status](#-current-model-quality-status). The bootstrap weights shipped under [`models/v1.0.0/`](models/v1.0.0/) are intentionally random and exist only for architectural smoke testing.
616
+ - Beam search is implemented and selectable, but a head-to-head benchmark against greedy on a real checkpoint is part of in-progress Phase 1b validation, not a published result yet.
617
+ - CIDEr / METEOR / ROUGE-L are implemented and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
618
+ - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
619
+
620
+ These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
621
 
622
  ---
623
 
624
+ ## 🧭 What I'd Build Next
625
 
626
+ Clear extension paths beyond the current scope, ordered by how much I'd learn building them:
627
 
628
+ - **Foundation-model fine-tuning** β€” fine-tune BLIP-2 or LLaVA on COCO and benchmark per-token cost vs. caption quality against the InceptionV3 + Transformer baseline.
629
+ - **Streaming generation** β€” server-sent events from `/v1/captions` so the SPA renders tokens as the decoder produces them, instead of waiting for the full sequence.
630
+ - **Batch inference endpoint** β€” a second route that accepts an array of images, runs them through one TF graph call, and amortises the per-request Python overhead β€” useful for any downstream pipeline that needs to caption a folder.
631
+ - **Visual Question Answering** β€” extend the same encoder + decoder pattern to `POST /v1/vqa` taking image + question, sharing the warmed CNN encoder.
632
+ - **VLM-backed comparison endpoint** β€” an opt-in route that runs the same image through Anthropic Claude vision or OpenAI Vision behind a feature flag, returns both captions, and surfaces a side-by-side card in the SPA. The framing is *"here's what a 2024 VLM does for the same input"*, not a replacement for the local model.
633
+ - **Online evaluation** β€” a background job that periodically scores the latest checkpoint against a held-out COCO slice and pushes BLEU / CIDEr / latency to a Grafana dashboard, so model regressions surface without a human running `scripts/evaluate.py`.
634
+ - **Active-learning loop** β€” surface low-confidence captions in the SPA, capture user corrections, and route them into a labelled corpus for the next training run.
 
 
635
 
636
  ---
637
 
638
+ ## πŸ“š Lessons Being Learned
639
 
640
+ > The hardest engineering skill on a research β†’ production conversion is not the code β€” it is the discipline of *not improving the model* while you fix the codebase around it. Every quality intervention you fold in mid-refactor makes the parity audit ambiguous: when the numbers change, you cannot tell whether the new metric harness, the new tokenisation, the new decoder, or the new training schedule was responsible. The four ablatable flags in [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml) exist specifically so each change can be diffed in isolation.
 
 
641
 
642
+ > Pydantic with `extra="forbid"` has caught more real bugs in this codebase than every other tool combined. A typo in a YAML key that silently uses a default is the single most expensive class of bug in ML, and the fix is one config option.
643
 
644
+ > The split between research config (`AppConfig`) and serving config (`BackendSettings`) felt over-engineered the day it was introduced and has paid for itself every week since. The two surfaces change on different cadences, ship on different schedules, and need different env-var prefixes for the deploy story to make sense. Conflating them would have meant every backend-only env var lived in a research YAML.
645
 
646
+ > Notebook freezing is the smallest possible piece of engineering that earns the largest amount of trust. A SHA-256 file plus a pre-commit hook plus one CI step is enough to guarantee the published research is exactly what reviewers think it is, three years from now.
647
 
648
  ---
649
 
650
+ ## πŸ“ License & Contact
651
+
652
+ This project is released under the [MIT License](LICENSE).
653
+
654
+ **Built by [apoorvrajdev](https://github.com/apoorvrajdev)** β€” reach me at [apoorvrajmgr@gmail.com](mailto:apoorvrajmgr@gmail.com).
655
+
656
+ Contribution + commit governance for this repo is codified in [`CLAUDE.md`](CLAUDE.md).
657
 
658
+ ---
659
+
660
+ <p align="center">
661
+ <em>Built as a flagship portfolio project for ML and multimodal-AI engineering roles.</em>
662
+ </p>