Spaces:
Configuration error
Configuration error
File size: 59,824 Bytes
77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce 131c45c 77c9bce f30f737 77c9bce 2ab9a5b 77c9bce 2ab9a5b cce4499 2ab9a5b cce4499 77c9bce 4e0b47e cce4499 5cb7968 cce4499 77c9bce 131c45c f30f737 5cb7968 f30f737 77c9bce cce4499 77c9bce cce4499 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b 77c9bce 2ab9a5b cce4499 2ab9a5b cce4499 77c9bce 2ab9a5b 77c9bce cce4499 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b 77c9bce cce4499 77c9bce cce4499 2ab9a5b cce4499 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 a0f0210 5cb7968 77c9bce a0f0210 77c9bce cce4499 2ab9a5b a0f0210 2ab9a5b 4b41a19 77c9bce 4b41a19 131c45c 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 2ab9a5b 77c9bce a0f0210 77c9bce 2ab9a5b 77c9bce 4b41a19 77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b cce4499 2ab9a5b 77c9bce 2ab9a5b 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce 4b41a19 77c9bce 4b41a19 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 4b41a19 77c9bce 4b41a19 77c9bce 2ab9a5b 4b41a19 77c9bce 4b41a19 77c9bce 4b41a19 77c9bce cce4499 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 77c9bce 131c45c 77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce cce4499 2ab9a5b 77c9bce 2ab9a5b cce4499 2ab9a5b 77c9bce 2ab9a5b 77c9bce 2ab9a5b cce4499 77c9bce cce4499 77c9bce ba08eee 77c9bce ba08eee 77c9bce f30f737 77c9bce ed6ea78 2461f82 2485c95 c062f77 77c9bce f30f737 6a4b2fc f30f737 77c9bce cce4499 77c9bce cce4499 77c9bce 2ab9a5b 77c9bce cce4499 77c9bce cce4499 77c9bce 2ab9a5b 77c9bce cce4499 77c9bce cce4499 77c9bce a0f0210 77c9bce a0f0210 77c9bce a0f0210 77c9bce a0f0210 77c9bce a0f0210 77c9bce a0f0210 77c9bce 5cb7968 77c9bce 4b41a19 cce4499 77c9bce cce4499 77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce 2ab9a5b 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce cce4499 77c9bce 2ab9a5b 77c9bce | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 | <h1 align="center">Image Captioning System</h1>
<p align="center">
<strong>CNN + Transformer image-to-language pipeline, lifted from an IEEE-published research notebook into a typed, tested, full-stack production codebase.</strong>
</p>
<p align="center">
<img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white">
<img alt="TensorFlow 2.15" src="https://img.shields.io/badge/TensorFlow-2.15-FF6F00?style=flat-square&logo=tensorflow&logoColor=white">
<img alt="FastAPI" src="https://img.shields.io/badge/FastAPI-0.111-009688?style=flat-square&logo=fastapi&logoColor=white">
<img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-E92063?style=flat-square&logo=pydantic&logoColor=white">
<img alt="React 19" src="https://img.shields.io/badge/React-19-61DAFB?style=flat-square&logo=react&logoColor=black">
<img alt="Vite 8" src="https://img.shields.io/badge/Vite-8-646CFF?style=flat-square&logo=vite&logoColor=white">
</p>
<p align="center">
<img alt="Ruff" src="https://img.shields.io/badge/lint-ruff-261230?style=flat-square&logo=ruff&logoColor=white">
<img alt="mypy strict" src="https://img.shields.io/badge/typed-mypy%20strict-1F5082?style=flat-square">
<img alt="Tests" src="https://img.shields.io/badge/tests-94%20passing-brightgreen?style=flat-square">
<img alt="Pre-commit" src="https://img.shields.io/badge/pre--commit-enabled-FAB040?style=flat-square&logo=pre-commit&logoColor=white">
<img alt="IEEE Published" src="https://img.shields.io/badge/IEEE-published-00629B?style=flat-square&logo=ieee&logoColor=white">
<img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-blue?style=flat-square">
</p>
<p align="center">
A deliberately scoped multimodal-AI showcase that takes a published research notebook and turns it into the kind of codebase a serving team would actually maintain β typed configuration, a structured FastAPI inference service, a polished React SPA, a parity-audit gate against the original notebook, and an honest roadmap that names what is shipped and what is not.
</p>
---
## Status
> β
**Deployed.** Phase 2C (public deployment) is complete. The research β modular conversion (Phase 1) and the full inference stack (Phase 2A backend + 2B frontend) ship as a live, publicly reachable system: a React 19 / Vite 8 SPA at [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) posts multipart uploads to `POST /v1/captions` against a Dockerised FastAPI service running on a HuggingFace Space at [`apoorvrajdev-image-captioning-api.hf.space`](https://apoorvrajdev-image-captioning-api.hf.space), which pulls its versioned weights from [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on the Hub at lifespan startup via `snapshot_download`. The lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check, and a four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C shipped a hardened backend test suite (12 route tests covering the full 200 / 400 / 413 / 415 / 422 / 503 contract via a duck-typed fake predictor, full slice runs in 0.3 s), a multi-stage Dockerfile, Hub-versioned weight loading with an injectable downloader for offline testing, explicit production CORS wired through Space variables, a four-job GitHub Actions CI pipeline (ruff + mypy, pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze, frontend lint + build) plus a chained `deploy-backend.yml` that pushes `main` to the Space remote only after CI is green, and a full deployment runbook at [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). Next up: Phase 3 (multimodal baselines) β see [Roadmap](#-roadmap).
> π **Trained checkpoint shipped.** The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) was trained on COCO 2017 (95,918 train captions, 24,082 val captions, 10 epochs, Kaggle T4 Γ2, cosine LR with 500-step warmup, label smoothing 0.1). Results on a 500-sample val2017 slice:
>
> | Decode strategy | BLEU-1 | BLEU-4 | ROUGE-L | METEOR | CIDEr |
> |---|---|---|---|---|---|
> | Greedy | 42.20 | 10.57 | 37.57 | 15.45 | 0.789 |
> | Beam (w=4, lp=0.7, rp=1.2) | 41.93 | 10.39 | 36.84 | 15.56 | **0.826** |
>
> Full artefacts: [`results/stabilized-greedy/`](results/stabilized-greedy/) and [`results/stabilized-beam-w4-lp07-rp12/`](results/stabilized-beam-w4-lp07-rp12/). The trained weights are hosted on the Hub at [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) and loaded by the backend at startup β the live demo now produces real captions.
---
## π Live Demo
| Component | URL | What you can do |
|---|---|---|
| **Frontend SPA** | https://image-captioning-system.vercel.app | Drag-and-drop an image, hit **Generate caption**, see the typed `CaptionResponse` rendered with model version, decode strategy, and latency |
| **Backend API** | https://apoorvrajdev-image-captioning-api.hf.space | Interactive Swagger at [`/docs`](https://apoorvrajdev-image-captioning-api.hf.space/docs); liveness + readiness at [`/healthz`](https://apoorvrajdev-image-captioning-api.hf.space/healthz); inference at `POST /v1/captions` |
| **Weights (HF Hub)** | https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer | Pinned to tag `v1.0.0`; the backend pulls these at lifespan startup via `snapshot_download` so the Space's git tree never contains the `.h5` |
Deployment topology: GitHub `main` β CI on every push β on green, `deploy-backend.yml` pushes to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single uvicorn worker); Vercel's Git integration builds and promotes the SPA in parallel. Production CORS is wired through the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable, not a hardcoded config. Full topology + rollback procedure: [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). CI/CD workflows: [`docs/CI.md`](docs/CI.md).
> π‘ The live demo produces real captions from a COCO-trained checkpoint (CIDEr 0.83). Example: *"a bathroom with a toilet and a sink"*, *"a man riding skis down a snow covered slope"*. See [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl) for 30 sample predictions vs. ground-truth references.
---
## π What Is This Project?
Image Captioning System is a research-to-production conversion of the IEEE paper *"AI Narratives: Bridging Visual Content and Linguistic Expression"*. The original work β a Kaggle notebook training an InceptionV3-encoder + multi-head Transformer-decoder on MS COCO β is preserved verbatim as the canonical research artefact. Around it sits a typed Python package, a FastAPI inference service, and a React SPA that together turn the published model into something a serving team could actually run, version, and reason about.
It is **not** a hosted product (yet β Phase 2C is shipping that), and it is **not** a thin Streamlit wrapper around `model.predict`. What this project *is* is a deliberate engineering showcase aimed at hiring teams evaluating ML, multimodal-AI, and backend skills, and at anyone who has ever wondered what it actually takes to lift a research notebook into a codebase the rest of an engineering org can build on. Every architectural decision in this repository is one I can defend in an interview.
---
## π― Why It Matters
Research notebooks and production ML systems are different artefacts with different audiences. A notebook proves an idea works. A production system has to **survive being maintained** β by people who did not write it, on schedules nobody planned, against inputs the original author never anticipated. The hardest part of an ML career is not getting a model to converge once; it is making the resulting pipeline *legible, typed, testable, deployable, and replaceable* without losing the behaviour the paper claimed.
This project demonstrates that conversion end-to-end at a scale one engineer can build and reason about:
- **Parity-gated refactor** β the notebook stays byte-stable and a four-stage audit script asserts the modular package reproduces the notebook's behaviour at every behavioural seam.
- **Strict typed configuration** β Pydantic v2 with `extra="forbid"` so a typo in a hyperparameter is a load-time error, not a silent training run that produces wrong numbers.
- **Lifespan-managed inference** β one warm `CaptionPredictor` shared across every HTTP request, not a graph rebuilt per call.
- **Train/serve shared preprocessing** β the same `preprocess_image_tensor` runs in `tf.data` pipelines and at inference, so the bytes that enter the model in training are byte-identical to the bytes that enter it at serve time.
- **Stabilized training experiments behind ablatable flags** β every quality intervention is opt-in, so any delta between two runs is attributable to one named change rather than a tangled rewrite.
- **Reproducible benchmarking** β every evaluation writes a machine-readable `metrics.json` + `diagnostics.jsonl` set, so two checkpoints (or one checkpoint with two decoders) can be diffed without bespoke parsers.
---
## π‘ What This Project Demonstrates
- Lifting a research notebook into an **installable, typed Python package** (`src/` layout) without breaking the published architecture.
- A production-style **FastAPI** inference service with lifespan-managed model loading, structured logging, request-ID propagation, and a typed Pydantic schema for every payload.
- A polished **React 19 + Vite 8 + Tailwind v4** SPA with drag-and-drop upload, client-side validation, `AbortController` timeouts, typed `ApiError` classification, and a polled health badge.
- **Pydantic v2 strict configuration** with YAML + env-var overrides and `extra="forbid"` to eliminate the silent-defaults failure mode.
- **Custom multi-head Transformer decoder** with masked sparse-categorical cross-entropy, masked accuracy, learned (not sinusoidal) positional embeddings, and the IEEE paper's exact dropout / head configuration.
- **Beam search decoder** with length normalisation and n-gram repetition suppression alongside greedy, selectable per inference call and per evaluation run.
- **Corpus-level metric suite** β BLEU-1..4 (sacrebleu), CIDEr, METEOR, ROUGE-L β emitted as one typed artefact per run.
- **Notebook freeze + parity audit** β SHA-256 lock on the IEEE notebook plus a four-stage inline re-implementation that fails CI if the modular path drifts.
- **Pre-commit governance** β Ruff, mypy (strict), `nbstripout`, `gitleaks`, line-ending and TOML/YAML hygiene, all enforced before commits land.
- **Clean Git workflow** with Conventional Commits and small, reviewable changesets ([`CLAUDE.md`](CLAUDE.md) codifies the contribution rules).
---
## ποΈ Architecture
```
βββββββββββββββββββββββββββββββββββββββββ
β React 19 + Vite 8 SPA β
β Tailwind v4 Β· AbortController Β· ApiError β
ββββββββββββββββββββ¬βββββββββββββββββββββ
β multipart/form-data
ββββββββββββββββββββΌβββββββββββββββββββββ
β FastAPI 0.111 (Pydantic v2) β
β RequestContextMiddleware Β· /healthz Β· /v1/captions β
ββββββββββββββββββββ¬βββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββ
β PredictorService (anyio thread) β
β bytes β tensor β predict β caption β
ββββββββββββββββββββ¬βββββββββββββββββββββ
β singleton, warmed in lifespan
ββββββββββββββββββββΌβββββββββββββββββββββ
β CaptionPredictor (TensorFlow) β
β InceptionV3 β TF encoder β TF decoder β tokenizer β
ββββββββββββββββββββ¬βββββββββββββββββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββββ
β models/vX.Y.Z/ artefacts β
β model.h5 Β· vocab.json (versioned) β
βββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β configs/*.yaml (Pydantic v2, extra="forbid") β
β drives training, evaluation, AND serving β
βββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Model topology
```
ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββ
β Input image ββββΆβ InceptionV3 ββββΆβ Transformer ββββΆβ Transformer ββββΆβ Caption β
β 299Γ299Γ3 β β encoder β β encoder β β decoder β β string β
ββββββββββββββββ β (ImageNet, β β (1 layer, β β (2 layers, β ββββββββββββββ
β frozen) β β 1 head) β β 8 heads) β
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
βΌ βΌ βΌ
[B, 64, 2048] [B, 64, 512] [B, T, vocab=15000]
```
### Components
- **CNN encoder** β [`models/encoder_cnn.py`](src/captioning/models/encoder_cnn.py). Pretrained InceptionV3 with the classification head removed; output reshaped to 64 spatial positions Γ 2048 channels. Weights frozen during training.
- **Transformer encoder** β [`models/transformer_encoder.py`](src/captioning/models/transformer_encoder.py). Single layer, one attention head. Projects InceptionV3 features into the decoder's embedding dimension.
- **Embeddings** β [`models/embeddings.py`](src/captioning/models/embeddings.py). Sum of token + *learned* positional embeddings, preserved verbatim from the published architecture.
- **Transformer decoder** β [`models/transformer_decoder.py`](src/captioning/models/transformer_decoder.py). Causal self-attention over partial captions, cross-attention over image features, feed-forward sub-block. 8 heads, `embedding_dim=512`, dropouts (0.1 / 0.3 / 0.5) preserved from the IEEE configuration.
- **Captioning model** β [`models/captioning_model.py`](src/captioning/models/captioning_model.py). Custom `train_step` / `test_step` with masked sparse-categorical cross-entropy and masked accuracy.
- **Tokenizer** β [`preprocessing/tokenizer.py`](src/captioning/preprocessing/tokenizer.py). `CaptionTokenizer` wraps `tf.keras.layers.TextVectorization`; persists vocabulary as both pickle (notebook-compatible) and JSON sidecar.
- **Inference** β [`inference/predictor.py`](src/captioning/inference/predictor.py). `CaptionPredictor.from_artifacts(weights, vocab, config)` loads everything once at boot, exposes `predict_path(...)` and `predict_tensor(...)` for stateless calls, and `warmup()` to amortise first-request latency.
- **Configuration** β [`config/schema.py`](src/captioning/config/schema.py). Pydantic v2 (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`); strict so typos in YAML or env vars become load-time errors.
**Why a monolith on a single process?** Splitting training, evaluation, and serving across services would burn the project's budget on Kubernetes manifests instead of the things a reviewer can actually click. A layered package + one FastAPI app captures the same separation-of-concerns thinking with a tenth of the operational surface area, and the seams are placed so pulling serving into its own container (Phase 2C) is a deployment change, not a refactor.
**Why TensorFlow 2.15 specifically?** TF 2.16 ships Keras 3 by default and silently breaks `TextVectorization` save/load β the project's `tensorflow-cpu==2.15.0` pin is deliberate. Documented in [`requirements.txt`](requirements.txt) and in the engineering-decisions section below.
---
## πΌοΈ Sample outputs
| Image | Generated caption |
|---|---|
|  | *a man is standing on a beach with a surfboard* |
|  | *a man riding a motorcycle on a street* |
Outputs above are from the IEEE notebook; the modular pipeline reproduces these via the parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)). Live captions from the current bootstrap weights will *not* match β see [Current model quality status](#-current-model-quality-status).
---
## π Research backing
The model architecture and the BLEU-4 ~24 baseline below come from the IEEE paper and its accompanying notebook:
- **Paper:** [AI Narratives: Bridging Visual Content and Linguistic Expression](https://ieeexplore.ieee.org/document/10675203) (IEEE)
- **Original notebook:** [Kaggle β image-captioning-using-dl](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
- **Frozen artefact in this repo:** [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](notebooks/01_ieee_inceptionv3_transformer.ipynb) β byte-stable; pre-commit + CI enforce its SHA-256.
The notebook is preserved verbatim as the canonical research artefact. Improvements happen in the modular package; the notebook does not.
---
## π Performance
| Metric | Value | Source |
|---|---|---|
| BLEU-4 (IEEE baseline) | ~24 | Reported in the IEEE paper / Kaggle notebook |
| Vocabulary size | 15,000 tokens | `TextVectorization` adapt over preprocessed COCO captions |
| Training set | ~120k captions sampled from COCO 2017 | `data.sample_size` in [`configs/base.yaml`](configs/base.yaml) |
| Image resolution | 299 Γ 299 (InceptionV3) | [`preprocessing/image.py`](src/captioning/preprocessing/image.py) |
| Max caption length | 40 tokens | `model.max_length` in [`configs/base.yaml`](configs/base.yaml) |
| Backend test suite | 12 tests Β· 0.3 s Β· no TF loaded | [`backend/app/tests/`](backend/app/tests/) |
| Full suite | **90 tests passing** | `pytest` (unit + backend + parity) |
> Re-training on the modular pipeline is a Phase 1b deliverable; once a fresh checkpoint exists, this table will publish corpus BLEU-1..4, CIDEr, METEOR, and ROUGE-L (the harnesses already exist under [`evaluation/`](src/captioning/evaluation/)).
---
## π Model quality β stabilized training results
The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) converged on COCO 2017 in 10 epochs on Kaggle T4 Γ2. Training loss dropped monotonically from 4.69 (epoch 1) to 3.33 (epoch 10); validation accuracy climbed from 0.43 to 0.48. No overfitting was observed β val_acc was still rising at epoch 10.
### Corpus-level metrics (500-sample val2017 slice)
| Metric | Greedy | Beam (w=4, lp=0.7, rp=1.2) |
|---|---|---|
| BLEU-1 | 42.20 | 41.93 |
| BLEU-2 | 26.09 | 25.41 |
| BLEU-3 | 16.52 | 16.01 |
| BLEU-4 | 10.57 | 10.39 |
| ROUGE-L | 37.57 | 36.84 |
| METEOR | 15.45 | 15.56 |
| CIDEr | 0.789 | **0.826** |
Beam search trades a marginal n-gram overlap regression for a +5% CIDEr lift β CIDEr down-weights generic phrases and rewards image-specific vocabulary, making it the better quality signal for captioning. Full artefact sets (metrics, predictions, diagnostics, qualitative samples) are committed under [`results/`](results/).
### Qualitative highlights
The model produces fluent, semantically grounded captions with correct object identification across diverse scenes. Sample predictions vs. COCO references (beam decode):
| Image | Predicted | Reference | BLEU-4 |
|---|---|---|---|
| 000000129379 | a woman sitting on a bench talking on a cell phone | a woman sitting on a cement wall talking on a cell phone | 64.1 |
| 000000360371 | a white toilet sitting in a bathroom next to a sink | a toilet sitting in a bathroom next to a scale | 69.9 |
| 000000402020 | a sandwich on a plate on a table | a sandwich on a plate and full wine glass are under blurry lights | 74.2 |
| 000000082881 | a man riding skis down a snow covered slope | two people ski over a snow covered slope | 29.8 |
| 000000252596 | a person riding a skateboard down a street | a person skateboards down a street that has greenery on either side | 15.7 |
Known failure modes: colour attribute errors (red vs. yellow), count mismatches (one vs. two), generic fallback on unusual compositions. These are expected limitations of a frozen-InceptionV3 encoder and addressable in Phase 3 with modern vision backbones.
### Training configuration
| Parameter | Value |
|---|---|
| Encoder | InceptionV3 (frozen, ImageNet weights) |
| Decoder | Multi-head Transformer (4 heads, 512-dim) |
| Data | COCO 2017, 95,918 train / 24,082 val captions |
| Epochs | 10 (no early stopping triggered) |
| Batch size | 64 |
| LR schedule | Cosine decay, peak 0.001, 500-step warmup |
| Label smoothing | 0.1 |
| Platform | Kaggle T4 Γ2, TF 2.19, tf-keras 2.19 (legacy Keras 2 shim) |
| Wall-clock | ~3.3 hours |
---
## π οΈ Tech Stack
| Layer | Technologies |
|---|---|
| **Core ML** | Python 3.10β3.12, TensorFlow-CPU 2.15.0 (pinned), NumPy, Pillow |
| **Model** | InceptionV3 encoder (frozen) + custom multi-head Transformer decoder |
| **Backend** | FastAPI 0.111, Pydantic v2, `pydantic-settings` 2.x, structlog 24, anyio 4 |
| **Frontend** | React 19, Vite 8, Tailwind v4, ESLint flat config |
| **Evaluation** | sacrebleu, custom CIDEr / METEOR / ROUGE-L implementations |
| **Tooling** | Ruff (lint + format), mypy (strict), pytest 8, pre-commit, nbstripout, gitleaks |
| **Infra (planned, Phase 2C)** | HuggingFace Hub (weights), HuggingFace Spaces (backend), Vercel (frontend), GitHub Actions (CI/CD) |
---
## π Repository Structure
```
image-captioning-system/
βββ notebooks/
β βββ 01_ieee_inceptionv3_transformer.ipynb # FROZEN β IEEE research artefact
β βββ README.md # Frozen-notebook policy
β
βββ src/captioning/ # Installable package
β βββ config/ schema.py Β· loader.py
β βββ preprocessing/ caption.py Β· image.py Β· tokenizer.py Β· augmentation.py
β βββ data/ coco.py Β· splits.py Β· pipeline.py
β βββ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
β β transformer_decoder.py Β· captioning_model.py Β· factory.py
β βββ training/ losses.py Β· callbacks.py Β· trainer.py
β βββ inference/ image_loader.py Β· greedy.py Β· beam.py Β· predictor.py
β βββ evaluation/ bleu.py Β· cider.py Β· meteor.py Β· rouge.py
β β runner.py Β· benchmark.py Β· inspection.py Β· tokenization.py
β βββ utils/ logging.py Β· seed.py Β· hashing.py
β
βββ backend/ # Phase 2A β FastAPI inference service
β βββ app/
β βββ main.py # App factory + lifespan-managed predictor singleton
β βββ api/routes.py # Thin HTTP β /healthz, /v1/captions
β βββ core/ # BackendSettings, structlog setup, RequestContextMiddleware
β βββ schemas/ # Pydantic request/response models
β βββ services/predictor_service.py # bytes β caption + latency (anyio thread offload)
β βββ utils/image.py # Content-type allow-list + ImageDecodeError
β βββ tests/ # Phase 2C WS-D β 12 route tests, no TF loaded
β
βββ frontend/ # Phase 2B β React 19 + Vite 8 + Tailwind v4 SPA
β βββ vite.config.js Β· eslint.config.js Β· package.json Β· .env.example
β βββ src/
β βββ main.jsx Β· App.jsx Β· index.css
β βββ services/api.js # checkHealth / captionImage β AbortController + typed ApiError
β βββ components/
β βββ Header.jsx Β· StatusBadge.jsx # Sticky brand bar + 10s health poller
β βββ UploadZone.jsx Β· ImagePreview.jsx
β βββ CaptionResult.jsx Β· ErrorBanner.jsx Β· Spinner.jsx
β
βββ configs/
β βββ base.yaml # IEEE hyperparameters (notebook cell 6 mirror)
β βββ train/
β βββ debug.yaml # CI smoke override (1 epoch, 64 captions)
β βββ stabilized.yaml # Phase 1b stability experiment (4 ablatable flags)
β
βββ scripts/
β βββ train.py Β· evaluate.py Β· predict.py
β βββ inspect_predictions.py # Per-sample diagnostics + diagnostics.jsonl
β βββ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
β βββ notebook_module_audit.py # 4-stage parity gate vs. notebook
β
βββ tests/unit/ # 78 unit tests (parity, tokenizer, eval, splits, β¦)
βββ docs/ # restructure-plan Β· PHASE_0_NOTES Β· PHASE_1_NOTES Β· STABILIZED_TRAINING_RUNBOOK
βββ pyproject.toml Β· requirements*.txt Β· Makefile
βββ .pre-commit-config.yaml Β· .python-version Β· .env.example
βββ .paper-notebook.sha256 # Locked notebook hash for the freeze check
βββ CLAUDE.md # Contribution + commit governance
βββ README.md
```
---
## π Quick Start
### Prerequisites
- Python **3.10 β 3.12** (TensorFlow 2.15 has no 3.13 wheels)
- Node **20+**
- Git
### Backend
```powershell
# PowerShell (Windows)
py -3.10 -m venv .venv
.venv\Scripts\activate
pip install -r requirements-dev.txt -r requirements-eval.txt
pip install -e ".[hf,mlflow]"
pre-commit install
```
```bash
# bash (Linux / macOS)
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt -r requirements-eval.txt
pip install -e ".[hf,mlflow]"
pre-commit install
```
Boot the API:
```bash
uvicorn --app-dir backend app.main:app --host 0.0.0.0 --port 8000
```
Interactive Swagger UI is live at **http://localhost:8000/docs**; raw OpenAPI 3.1 at **http://localhost:8000/openapi.json**.
### Frontend
```bash
cd frontend
npm install
npm run dev
```
The SPA is live at **http://localhost:5173** (Vite picks the next free port if 5173 is busy). `VITE_API_BASE` (see [`frontend/.env.example`](frontend/.env.example)) points it at any backend origin; absent the env var, it falls back to `http://127.0.0.1:8000`.
### Tests
```bash
pytest -q # All 90 tests (unit + backend + parity)
pytest backend/app/tests/ -v # Backend route tests only (0.3 s, no TF loaded)
make freeze-paper-notebook # Asserts the IEEE notebook SHA-256 has not changed
```
### One-shot caption (CLI)
```bash
python -m scripts.predict \
--config configs/base.yaml \
--weights models/v1.0.0/model.h5 \
--tokenizer-dir models/v1.0.0 \
--image samples/photo.jpg
```
### One-shot caption (HTTP)
```bash
curl -X POST http://localhost:8000/v1/captions -F "image=@samples/photo.jpg"
```
### Reproduce training
```bash
python -m scripts.train --config configs/base.yaml
# Or with the stabilization experiment flags enabled:
python -m scripts.train --config configs/base.yaml --override configs/train/stabilized.yaml
# Or a 64-caption CI smoke run:
python -m scripts.train --config configs/base.yaml --override configs/train/debug.yaml
```
Outputs (`weights.h5`, `vocab.pkl` + `vocab.json` sidecar, `history.json`, `training_log.csv`) land under `outputs/runs/latest/` by default.
`make help` lists every available command (lint, format, type-check, test, train, serve, evaluate, predict, Docker, freeze-paper-notebook, β¦).
---
## π FastAPI backend (Phase 2A)
Phase 2A delivers a production-style inference service rather than a thin demo wrapper:
- **App factory + lifespan** β [`backend/app/main.py`](backend/app/main.py). `create_app()` builds the FastAPI instance; the lifespan loads the YAML `AppConfig`, instantiates a `CaptionPredictor`, calls `warmup()`, and stashes a `PredictorService` singleton on `app.state` so every request reuses one warm model.
- **Routes** β [`backend/app/api/routes.py`](backend/app/api/routes.py). Intentionally thin: validate inputs, delegate, shape the response. No TF imports leak into the HTTP layer.
- **Service layer** β [`backend/app/services/predictor_service.py`](backend/app/services/predictor_service.py). Wraps the predictor, decodes uploaded bytes off the event loop via `anyio.to_thread.run_sync`, measures per-request latency, returns `(caption, latency_ms)`.
- **Schemas** β [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py). Pydantic v2 (`CaptionResponse`, `HealthResponse`, `ErrorResponse`); every payload that crosses the wire is typed and OpenAPI-documented.
- **Backend settings** β [`backend/app/core/config.py`](backend/app/core/config.py). Separate `BackendSettings` (env-overridable: weights path, tokenizer dir, model version, warmup toggle) layered on top of the research-side `AppConfig`. Research hyperparameters and serving knobs change on different cadences and live in different settings objects.
- **Structured logging + request IDs** β [`backend/app/core/logging.py`](backend/app/core/logging.py). `RequestContextMiddleware` stamps each request with a UUID; `structlog` carries it through every log line so a single failed caption can be traced end-to-end.
- **Image safety** β [`backend/app/utils/image.py`](backend/app/utils/image.py). Content-type allow-list (JPEG / PNG / WebP / BMP), explicit `ImageDecodeError` so malformed bytes produce a clean 422 rather than a 500.
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/healthz` | Liveness + readiness β reports `model_loaded`, `model_version`, `api_version`. Always 200; readiness is conveyed in the body. |
| `POST` | `/v1/captions` | Multipart image upload β generated caption + decode strategy + latency + request ID. |
| `GET` | `/docs` | Interactive Swagger UI, auto-generated from the Pydantic schemas. |
| `GET` | `/openapi.json` | Raw OpenAPI 3.1 spec for client codegen. |
`POST /v1/captions` enforces input validation at the boundary: **415** on disallowed content types, **413** on oversized uploads (`serve.max_upload_bytes`), **422** on undecodable image bytes, **400** on empty uploads, **503** while the predictor is still loading during a rolling restart. All six status codes are covered by the [`backend/app/tests/`](backend/app/tests/) suite added in Phase 2C WS-D.
---
## π¨ Frontend UI (Phase 2B)
Phase 2B ships a single-page inference UI under [`frontend/`](frontend/) β not a styled demo. The split mirrors the backend's separation between transport, service, and presentation:
- **Application shell** β [`frontend/src/App.jsx`](frontend/src/App.jsx). Owns the request lifecycle (selected file β preview β generate β result). The preview `URL.createObjectURL` is `useMemo`-derived and revoked through an effect cleanup so previews never leak across uploads. Four `useState` slots (`file`, `result`, `error`, `loading`) cover every UI state β no Redux, no React Query, no context.
- **API service layer** β [`frontend/src/services/api.js`](frontend/src/services/api.js). Single boundary for every backend call. Reads `import.meta.env.VITE_API_BASE` once at module load (falls back to `http://127.0.0.1:8000`), wraps `fetch` with `AbortController`-driven timeouts (3 s for `/healthz`, 60 s for `/v1/captions`), and classifies failures into `timeout` / `network` / `http` / `unknown` kinds on a typed `ApiError`.
- **Upload zone** β [`frontend/src/components/UploadZone.jsx`](frontend/src/components/UploadZone.jsx). Drag/drop + click-to-browse + keyboard activation. Validates content-type (JPEG / PNG / WebP) and size (10 MB) before the file ever touches the network β invalid uploads are rejected client-side with the same wording the backend would have returned.
- **Status badge** β [`frontend/src/components/StatusBadge.jsx`](frontend/src/components/StatusBadge.jsx). Polls `/healthz` every 10 seconds and on window focus, runs a three-state machine (`checking` / `online` / `offline`), recovers automatically when the backend comes back.
- **Error banner** β [`frontend/src/components/ErrorBanner.jsx`](frontend/src/components/ErrorBanner.jsx). Single surface for every failure class. Reads `ApiError.message` so the user sees "Cannot reach backend" or "Request timed out" instead of a raw browser error.
- **Caption result** β [`frontend/src/components/CaptionResult.jsx`](frontend/src/components/CaptionResult.jsx). Consumes the backend's typed `CaptionResponse` directly: caption text plus model version, decode strategy, latency, and the request ID echoed from the `x-request-id` header.
```
ββββββββββββββββ drag/drop βββββββββββββββ validate ββββββββββββββββ
β UploadZone β ββββββββββββΆ β App state β βββββββββββΆ β ImagePreview β
ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β click "Generate"
βΌ
βββββββββββββββββββ multipart POST /v1/captions
β services/api.js β ββββββββββββΆ FastAPI backend
βββββββββββββββββββ
β typed CaptionResponse / ApiError
βΌ
ββββββββββββββββββββββββ
β CaptionResult / β
β ErrorBanner β
ββββββββββββββββββββββββ
```
Frontend and backend are deployed independently. The SPA only knows the backend's origin via `VITE_API_BASE`; the backend only trusts SPAs whose origin appears in `serve.cors_allowed_origins`. Dev origins are pre-allowed in [`configs/base.yaml`](configs/base.yaml); production origins join the same list at deploy time (Phase 2C WS-F). No shared build, no shared runtime β only the typed Pydantic schemas in [`backend/app/schemas/caption.py`](backend/app/schemas/caption.py) cross the wire.
---
## βοΈ Configuration system
Hyperparameters are not globals. They live in YAML validated by Pydantic v2:
```yaml
# configs/base.yaml β mirrors the IEEE notebook cell 6 verbatim
model:
embedding_dim: 512
units: 512
max_length: 40
vocabulary_size: 15000
decoder_num_heads: 8
decoder_dropout_inner: 0.3
decoder_dropout_outer: 0.5
decoder_attention_dropout: 0.1
train:
epochs: 10
batch_size: 64
early_stopping_patience: 3
seed: 42
data:
sample_size: 120000
train_val_split: 0.8
```
Three load-time guarantees:
1. **Type validation.** `batch_size: "64"` (string instead of int) raises a `ValidationError` pointing at the field, not a downstream tensor-shape error.
2. **No silent typos.** `extra="forbid"` rejects unknown keys β typos in ML hyperparameters silently using defaults is the worst failure mode, and `extra="forbid"` eliminates it.
3. **Env overrides.** `CAPTIONING__TRAIN__BATCH_SIZE=32` overrides at any nesting depth β useful for CI smoke tests, ablations, and serve-time tuning without rebuilding images.
Schema in [`src/captioning/config/schema.py`](src/captioning/config/schema.py); loader in [`src/captioning/config/loader.py`](src/captioning/config/loader.py).
---
## π§ͺ Testing & code quality
```bash
make test # pytest β 90/90 (unit + backend route tests + parity)
make lint # Ruff lint + format check
make typecheck # mypy strict on src/captioning + scripts
make pre-commit # All hooks across all files
make freeze-paper-notebook # Asserts notebook SHA-256 unchanged
```
| Layer | Tool | Status |
|---|---|---|
| Lint + format | [Ruff](https://docs.astral.sh/ruff/) (replaces black + isort + flake8) | β
clean |
| Type-check | [mypy](https://mypy.readthedocs.io/) with `pandas-stubs`, `types-PyYAML`, `types-requests` | β
0 errors |
| Tests | pytest + pytest-cov + pytest-asyncio | β
90 passing |
| Notebook hygiene | [`nbstripout`](https://github.com/kynan/nbstripout) (pre-commit) | β
outputs stripped on commit |
| Secret scanning | [`gitleaks`](https://github.com/gitleaks/gitleaks) (pre-commit) | β
enabled |
| Notebook integrity | SHA-256 freeze via [`make freeze-paper-notebook`](Makefile) | β
locked |
| Parity audit | [`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py) β 4 stages | β
all passing |
The parity audit re-implements four notebook stages inline (caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, decoder forward pass) and asserts the modular path produces byte-identical (or `tf.allclose`-identical) output. It is the contract that gates any behavioural improvement.
The backend test suite ([`backend/app/tests/`](backend/app/tests/)) introduced in Phase 2C WS-D uses a duck-typed `FakePredictorService` to exercise every status code in the `/v1/captions` contract β 200 / 400 / 413 / 415 / 422 / 503 β plus the `/healthz` readiness flip and `x-request-id` propagation, all without loading TensorFlow. The full backend slice runs in **0.3 seconds**.
---
## πΊοΈ Roadmap
### Phase 0 β Bootstrap β
- [x] **0A** β Repo scaffolding, `pyproject.toml`, Makefile, Conventional Commits
- [x] **0B** β Pre-commit hooks (Ruff, mypy, nbstripout, gitleaks, line-ending + TOML/YAML hygiene)
- [x] **0C** β Notebook freeze policy + `.paper-notebook.sha256` SHA-256 lock
- [x] **0D** β Pinned dependency surface (`requirements*.txt` + `pyproject.toml` extras: `hf`, `eval`, `mlflow`, `dev`)
### Phase 1 β Modularisation β
- [x] **1A** β Notebook β installable `captioning` package (`src/` layout)
- [x] **1B** β Pydantic v2 strict config (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`) with YAML loader + env-var overrides
- [x] **1C** β Preprocessing modules (`caption.py`, `image.py`, `tokenizer.py`, `augmentation.py`) β shared train/serve preprocessing
- [x] **1D** β Data pipeline (`coco.py`, `splits.py`, `pipeline.py`) with seeded sampling
- [x] **1E** β Model factory (`encoder_cnn.py`, `transformer_encoder.py`, `embeddings.py`, `transformer_decoder.py`, `captioning_model.py`, `factory.py`)
- [x] **1F** β Training loop (`losses.py`, `callbacks.py`, `trainer.py`) with structured logging + history serialisation
- [x] **1G** β Greedy inference (`predictor.py`, `image_loader.py`, `greedy.py`) with lifespan-friendly `from_artifacts(...)` + `warmup()`
- [x] **1H** β Notebook parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) β 4 stages, byte/tensor-identical
- [x] **1I** β Unit test suite (parity, tokenizer, evaluation, splits, hashing, image preprocessing, caption preprocessing)
### Phase 1b β Training stabilization β
- [x] **1b-A** β Beam-search decoder ([`inference/beam.py`](src/captioning/inference/beam.py)) with length normalisation + n-gram repetition suppression, selectable per call/run
- [x] **1b-B** β CIDEr implementation ([`evaluation/cider.py`](src/captioning/evaluation/cider.py))
- [x] **1b-C** β METEOR implementation ([`evaluation/meteor.py`](src/captioning/evaluation/meteor.py))
- [x] **1b-D** β ROUGE-L implementation ([`evaluation/rouge.py`](src/captioning/evaluation/rouge.py))
- [x] **1b-E** β Benchmark runner ([`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)) emitting one `metrics.json` + `diagnostics.jsonl` per run
- [x] **1b-F** β Per-sample inspection tool ([`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)) β sentence-level BLEU/ROUGE, length, longest repeated-token run, failure flags
- [x] **1b-G** β Stabilization config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) β label smoothing, cosine LR, warmup, dropout-free validation, all ablatable
- [x] **1b-H** β Stabilized training runbook ([`docs/STABILIZED_TRAINING_RUNBOOK.md`](docs/STABILIZED_TRAINING_RUNBOOK.md))
- [x] **1b-I** β Fresh stabilized COCO-trained checkpoint uploaded to HF Hub [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) (tag `v2.0.0`)
- [x] **1b-J** β Headline numbers (BLEU-1..4, CIDEr, METEOR, ROUGE-L) published in [Model quality](#-model-quality--stabilized-training-results) and committed under [`results/`](results/)
### Phase 2A β FastAPI inference service β
- [x] **2A-1** β App factory + lifespan-managed `CaptionPredictor` singleton with `warmup()` on boot
- [x] **2A-2** β Thin `/healthz` and `POST /v1/captions` routes with full status-code contract (200 / 400 / 413 / 415 / 422 / 503)
- [x] **2A-3** β Pydantic v2 schemas (`CaptionResponse`, `HealthResponse`, `ErrorResponse`) with auto-generated Swagger + OpenAPI 3.1
- [x] **2A-4** β `PredictorService` with `anyio.to_thread.run_sync` offload so TF inference never blocks the event loop
- [x] **2A-5** β Structured logging (`structlog`) + `RequestContextMiddleware` propagating `x-request-id` across log lines
- [x] **2A-6** β `BackendSettings` separated from research `AppConfig` (different change cadences, different env prefixes)
- [x] **2A-7** β Bootstrap dev artefacts script so the API boots before training has produced real weights
### Phase 2B β Frontend SPA β
- [x] **2B-1** β React 19 + Vite 8 + Tailwind v4 scaffolding, flat ESLint config with `eslint-plugin-react-hooks` + `eslint-plugin-react-refresh`
- [x] **2B-2** β Drag/drop + click-to-browse upload zone with keyboard activation and client-side content-type + size validation
- [x] **2B-3** β `services/api.js` boundary: `VITE_API_BASE` env, `AbortController` timeouts (3 s health / 60 s caption), typed `ApiError` classification
- [x] **2B-4** β Polled `/healthz` status badge with three-state machine, window-focus refetch, and automatic recovery
- [x] **2B-5** β Typed `CaptionResponse` rendering β caption, model version, decode strategy, latency, request ID β with copy-to-clipboard
- [x] **2B-6** β Single `ErrorBanner` surface mapping every `ApiError.kind` to actionable copy
- [x] **2B-7** β CORS allow-list wired through backend YAML (`serve.cors_allowed_origins`), dev origins pre-allowed
### Phase 2C β Public deployment β
(complete)
- [x] **WS-A** β Backend containerisation: `Dockerfile` (python:3.11-slim, non-root UID 1000, EXPOSE 7860, HEALTHCHECK on `/healthz`) + `.dockerignore` + corrected `.env.example` schema
- [x] **WS-A4** β Lifespan integration with HuggingFace Hub: extended `BackendSettings` with `weights_hub_repo` / `weights_hub_revision` / `weights_hub_filename` / `weights_cache_dir`; new `app.services.weights_loader.resolve_weights` calls `huggingface_hub.snapshot_download` when configured, falls back to local paths otherwise (4 new unit tests, downloader injected for offline testing)
- [x] **WS-B** β Uploaded dev-scaffold weights + tokenizer to [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on HuggingFace Hub, tagged `v1.0.0`, verified via `snapshot_download` (SHA-256 hashes match local artefacts byte-for-byte)
- [x] **WS-C** β First manual deploy to [`apoorvrajdev/image-captioning-api`](https://huggingface.co/spaces/apoorvrajdev/image-captioning-api) on HuggingFace Spaces (Docker SDK, cpu-basic, port 7860, single worker) β Space variables wire `BACKEND_WEIGHTS_HUB_REPO` / `_REVISION` / `_FILENAME` + `BACKEND_WARMUP=true`; lifespan pulls weights from the Hub on cold start; `/healthz` returns `model_loaded: true` and `/v1/captions` verified end-to-end via Swagger UI
- [x] **WS-D** β **Backend test suite** ([`backend/app/tests/`](backend/app/tests/)): 12 route tests covering the full `/healthz` + `/v1/captions` contract (200 / 400 / 413 / 415 / 422 / 503) with a duck-typed `FakePredictorService` β no TF loaded, full slice runs in 0.3 s
- [x] **WS-E** β Frontend deploy to Vercel: `frontend/` imported as a Vite project, `VITE_API_BASE` env var baked at build time, production alias [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) auto-redeployed on every push to `main` via Vercel's GitHub integration
- [x] **WS-F** β Production CORS: deployed Vercel origin added to `serve.cors_allowed_origins` via the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable (JSON array, pydantic-settings parsed), so the policy is explicit in app config rather than relying on the HF reverse-proxy default
- [x] **WS-G** β GitHub Actions CI/CD:
- [x] `ci.yml` β Python quality (ruff lint + format check, mypy), pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze check, frontend lint + build, concurrency cancel-in-progress, pip + npm caching
- [x] [`deploy-backend.yml`](.github/workflows/deploy-backend.yml) β chained via `workflow_run` after CI, pushes `HEAD:main` to the HF Space remote using the `HF_TOKEN` repo secret; also supports `workflow_dispatch` for manual redeploys
- [x] `deploy-frontend.yml` *(skipped β Vercel-native GitHub integration deploys on every push, no separate workflow needed)*
- [x] **WS-H** β "[Live Demo](#-live-demo)" section above + [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md) (full topology, prerequisites, weights upload, Space setup, Vercel setup, CORS, CI/CD, smoke tests, known quirks, rollback) + [`docs/CI.md`](docs/CI.md) (workflow reference)
### Phase 3 β Multimodal baselines β³ (planned)
- [ ] **3A** β Side-by-side comparison harness: original CNN + Transformer vs. BLIP-base vs. ViT-GPT2 vs. GIT-base-coco
- [ ] **3B** β Per-model BLEU / CIDEr / METEOR / ROUGE-L on a shared COCO slice with deterministic tokenisation
- [ ] **3C** β Per-model latency benchmarking (single-image, batch, CPU vs. GPU)
- [ ] **3D** β Comparison-result dashboard exposed through the existing SPA
### Phase 4 β Observability β³ (planned)
- [ ] **4A** β Sentry error tracking on backend + frontend
- [ ] **4B** β Prometheus metrics (per-route latency histograms, predictor cache hits, lifespan boot duration)
- [ ] **4C** β DagsHub-hosted MLflow tracking link surfaced in the README
- [ ] **4D** β Architecture Decision Records (`docs/adr/`) β every non-trivial choice (TF version pin, anyio offload, env-var prefix separation, etc.) gets a one-page ADR
Detailed phase notes live under [`docs/`](docs/): [restructure plan](docs/restructure-plan.md) Β· [Phase 0 notes](docs/PHASE_0_NOTES.md) Β· [Phase 1 notes](docs/PHASE_1_NOTES.md) Β· [Stabilized training runbook](docs/STABILIZED_TRAINING_RUNBOOK.md).
---
## π― Engineering Decisions
> **Why preserve the notebook verbatim instead of refactoring it in place?**
> The notebook is the published research artefact and the only thing that can credibly produce the BLEU-4 ~24 baseline the IEEE paper claims. Editing it would silently destroy that reproducibility. The freeze + parity-audit pattern keeps the published result anchored while the modular package evolves; if the audit ever fails, the modular path has drifted from the paper and the diff is exactly where to start debugging.
> **Why pin `tensorflow-cpu==2.15.0`?**
> TF 2.16 ships Keras 3 as the default backend, and Keras 3 silently breaks `TextVectorization` save/load β the tokenizer round-trip the entire serving stack depends on. The pin is documented in [`requirements.txt`](requirements.txt) and protected by the env setup commands above. Phase 3's foundation-model baselines will live in optional dependency groups so they can install on a newer TF without unpinning the research pipeline.
> **Why two separate settings objects (`AppConfig` + `BackendSettings`)?**
> Research hyperparameters (`model.*`, `train.*`, `data.*`) and serving knobs (weights path, model version, warmup toggle, request-id header) change on different cadences and have different audiences. Folding them into one object would mean every backend env var lived in a research YAML, and every research-side schema change risked breaking a deploy. Two objects with two prefixes (`CAPTIONING__*` vs `BACKEND_*`) gives each surface its own change schedule.
> **Why `anyio.to_thread.run_sync` for inference instead of `async def predict`?**
> TensorFlow's `predict` call is synchronous and CPU-bound. Calling it directly from an async route handler would block the event loop and starve every other request. Offloading via `anyio.to_thread.run_sync` lets the event loop keep serving health checks and concurrent uploads while the model runs.
> **Why is the bootstrap-weights script committed?**
> The serving stack (lifespan, predictor wiring, multipart upload, frontend integration) has to be verifiable before a real COCO-trained checkpoint exists. The bootstrap script makes the entire path runnable from a fresh clone, which is what lets reviewers actually evaluate the architectural work independently of the model-quality work. The captions are gibberish β by design β and the README states that prominently to keep expectations honest.
> **Why `extra="forbid"` on every config schema?**
> ML projects fail catastrophically when a typo in a hyperparameter silently uses a default. `vocabularsy_size: 30000` should be a load-time error, not a quiet retraining run on the wrong vocabulary size. Strict configs are the cheapest possible insurance against the most expensive class of bug in this domain.
> **Why ship the metric suite and beam search *before* publishing new numbers?**
> Without deterministic tokenisation + a corpus-level runner + a non-greedy decoder, any "improved" number is unfalsifiable β it could be a real gain, a decoding artefact, or a tokenisation difference. The harness is the prerequisite to making the next training run mean something. Publishing the bar before the harness exists is how research projects accumulate numbers nobody can reproduce.
---
## π¬ Experimental evaluation pipeline
The repository is evolving from a "research notebook reproduction" into a reproducible experimentation platform. Evaluation is no longer a single BLEU number printed at the end of training β it is a structured set of artefacts any future run, including the Phase 3 multimodal baselines, can be diffed against.
- **[`scripts/evaluate.py`](scripts/evaluate.py)** β single entrypoint for full corpus evaluation. Loads a checkpoint + tokenizer, runs decoding (greedy or beam) over the COCO validation slice, computes BLEU-1..4 / CIDEr / METEOR / ROUGE-L, and writes a versioned artefact set under `results/<run_id>/`.
- **[`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)** β per-sample diagnostic view. Prints N random predictions vs. references with sentence-level BLEU-4 / ROUGE-L, prediction length, longest repeated-token run, and failure flags (`empty` / `very_short` / `repetitive` / `under_length`). Used when the aggregate metric moves but the qualitative behaviour does not.
- **[`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)** β `RunMeta` and `write_run_artifacts(...)`, the contract every evaluation run honours. Phase 3 cross-model comparison code joins multiple `results/<run_id>/` directories without bespoke parsers per model.
- **Greedy vs. beam evaluation support** β the same evaluator accepts `--decode-strategy greedy|beam` plus beam-search controls (`--beam-width`, `--length-penalty`, `--no-repeat-ngram-size`), so a single command-line difference produces directly comparable artefact sets for the same checkpoint.
---
## βοΈ Limitations
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
- BLEU-4 (10.57 greedy / 10.39 beam) is below the IEEE notebook's reported ~24. The gap is attributable to frozen encoder features and a 10-epoch budget; fine-tuning the encoder or training longer would close it. See [Model quality](#-model-quality--stabilized-training-results) for the full metric table.
- Colour attribute errors (red vs. yellow), count mismatches (one vs. two), and generic fallback on unusual compositions are the dominant failure modes β visible in [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl).
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
---
## π§ What I'd Build Next
Clear extension paths beyond the current scope, ordered by how much I'd learn building them:
- **Foundation-model fine-tuning** β fine-tune BLIP-2 or LLaVA on COCO and benchmark per-token cost vs. caption quality against the InceptionV3 + Transformer baseline.
- **Streaming generation** β server-sent events from `/v1/captions` so the SPA renders tokens as the decoder produces them, instead of waiting for the full sequence.
- **Batch inference endpoint** β a second route that accepts an array of images, runs them through one TF graph call, and amortises the per-request Python overhead β useful for any downstream pipeline that needs to caption a folder.
- **Visual Question Answering** β extend the same encoder + decoder pattern to `POST /v1/vqa` taking image + question, sharing the warmed CNN encoder.
- **VLM-backed comparison endpoint** β an opt-in route that runs the same image through Anthropic Claude vision or OpenAI Vision behind a feature flag, returns both captions, and surfaces a side-by-side card in the SPA. The framing is *"here's what a 2024 VLM does for the same input"*, not a replacement for the local model.
- **Online evaluation** β a background job that periodically scores the latest checkpoint against a held-out COCO slice and pushes BLEU / CIDEr / latency to a Grafana dashboard, so model regressions surface without a human running `scripts/evaluate.py`.
- **Active-learning loop** β surface low-confidence captions in the SPA, capture user corrections, and route them into a labelled corpus for the next training run.
---
## π Lessons Being Learned
> The hardest engineering skill on a research β production conversion is not the code β it is the discipline of *not improving the model* while you fix the codebase around it. Every quality intervention you fold in mid-refactor makes the parity audit ambiguous: when the numbers change, you cannot tell whether the new metric harness, the new tokenisation, the new decoder, or the new training schedule was responsible. The four ablatable flags in [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml) exist specifically so each change can be diffed in isolation.
> Pydantic with `extra="forbid"` has caught more real bugs in this codebase than every other tool combined. A typo in a YAML key that silently uses a default is the single most expensive class of bug in ML, and the fix is one config option.
> The split between research config (`AppConfig`) and serving config (`BackendSettings`) felt over-engineered the day it was introduced and has paid for itself every week since. The two surfaces change on different cadences, ship on different schedules, and need different env-var prefixes for the deploy story to make sense. Conflating them would have meant every backend-only env var lived in a research YAML.
> Notebook freezing is the smallest possible piece of engineering that earns the largest amount of trust. A SHA-256 file plus a pre-commit hook plus one CI step is enough to guarantee the published research is exactly what reviewers think it is, three years from now.
---
## π License & Contact
This project is released under the [MIT License](LICENSE).
**Built by [apoorvrajdev](https://github.com/apoorvrajdev)** β reach me at [apoorvrajmgr@gmail.com](mailto:apoorvrajmgr@gmail.com).
Contribution + commit governance for this repo is codified in [`CLAUDE.md`](CLAUDE.md).
---
<p align="center">
<em>Built as a flagship portfolio project for ML and multimodal-AI engineering roles.</em>
</p>
|