apoorvrajdev commited on
Commit
5cb7968
Β·
1 Parent(s): 540e3d5

docs(readme): replace dev-scaffold disclaimer with trained model metrics

Browse files
Files changed (1) hide show
  1. README.md +48 -24
README.md CHANGED
@@ -44,7 +44,14 @@ short_description: InceptionV3 + Transformer image captioning inference API
44
 
45
  > βœ… **Deployed.** Phase 2C (public deployment) is complete. The research β†’ modular conversion (Phase 1) and the full inference stack (Phase 2A backend + 2B frontend) ship as a live, publicly reachable system: a React 19 / Vite 8 SPA at [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) posts multipart uploads to `POST /v1/captions` against a Dockerised FastAPI service running on a HuggingFace Space at [`apoorvrajdev-image-captioning-api.hf.space`](https://apoorvrajdev-image-captioning-api.hf.space), which pulls its versioned weights from [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on the Hub at lifespan startup via `snapshot_download`. The lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check, and a four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C shipped a hardened backend test suite (12 route tests covering the full 200 / 400 / 413 / 415 / 422 / 503 contract via a duck-typed fake predictor, full slice runs in 0.3 s), a multi-stage Dockerfile, Hub-versioned weight loading with an injectable downloader for offline testing, explicit production CORS wired through Space variables, a four-job GitHub Actions CI pipeline (ruff + mypy, pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze, frontend lint + build) plus a chained `deploy-backend.yml` that pushes `main` to the Space remote only after CI is green, and a full deployment runbook at [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). Next up: Phase 3 (multimodal baselines) β€” see [Roadmap](#-roadmap).
46
 
47
- > ⚠️ **Caption quality disclaimer.** The weights committed under [`models/v1.0.0/`](models/v1.0.0/) are **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py): the architecture is wired correctly but every weight is randomly initialised. They exist to exercise the serving stack (lifespan, predictor wiring, multipart upload, frontend integration) before a real COCO-trained checkpoint is dropped in. Live captions therefore look like noise today β€” that is the *intended* state of the bootstrap path, not a regression. See [Current model quality status](#-current-model-quality-status) for what is being done about it.
 
 
 
 
 
 
 
48
 
49
  ---
50
 
@@ -58,7 +65,7 @@ short_description: InceptionV3 + Transformer image captioning inference API
58
 
59
  Deployment topology: GitHub `main` β†’ CI on every push β†’ on green, `deploy-backend.yml` pushes to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single uvicorn worker); Vercel's Git integration builds and promotes the SPA in parallel. Production CORS is wired through the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable, not a hardcoded config. Full topology + rollback procedure: [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). CI/CD workflows: [`docs/CI.md`](docs/CI.md).
60
 
61
- > ⚠️ The live caption you get will look like noise (e.g. `"plate city mountain [UNK]"`). That is the dev-scaffold weight story above β€” the infrastructure is what's being demonstrated end-to-end; the trained checkpoint is a future `v2.0.0` swap with **zero code changes** (just a Space variable bump from `v1.0.0` β†’ `v2.0.0`).
62
 
63
  ---
64
 
@@ -203,33 +210,51 @@ The notebook is preserved verbatim as the canonical research artefact. Improveme
203
 
204
  ---
205
 
206
- ## ⚠️ Current model quality status
207
 
208
- The frontend, backend, and inference pipeline are operational end-to-end against the modular package, but **caption quality from the current modular pipeline is still below expectations**. The IEEE notebook reported BLEU-4 ~24; a freshly trained checkpoint produced by the modular trainer has not yet reproduced that figure on COCO. The serving stack is production-style and ready for a real checkpoint β€” what is missing is the checkpoint itself.
209
 
210
- Current engineering effort is focused on:
211
 
212
- - **Training stability** β€” diagnosing why early modular training runs collapse onto a small set of high-frequency captions instead of generalising.
213
- - **Evaluation correctness** β€” moving from a single BLEU score to a full corpus-level metric suite with deterministic tokenisation, so two runs against the same slice are mechanically comparable.
214
- - **Decoding improvements** β€” replacing greedy-only generation with beam search, repetition controls, and length normalisation.
215
- - **Reproducible benchmarking** β€” emitting one consistent artefact set per evaluation run so any two runs (or any two models) can be diffed without bespoke parsing per checkpoint.
 
 
 
 
 
 
 
216
 
217
- The weights currently committed under [`models/v1.0.0/`](models/v1.0.0/) are the **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py). Captions returned by the live API today will look like noise; that is the *intended* state of the bootstrap path, not a regression. Poor caption quality at this stage is expected until a properly COCO-trained checkpoint replaces those files.
218
 
219
- This gap is being addressed through the **stabilized training workflow** at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml), which gates convergence-stability primitives behind explicit, ablatable flags rather than rewriting the baseline.
220
 
221
- ### Accuracy investigation (ongoing)
 
 
 
 
 
 
222
 
223
- - **Greedy decoding limited caption quality and diversity.** Argmax-per-step routinely picked the locally-most-probable token regardless of how that affected the overall sequence likelihood, biasing outputs toward a small "safe captions" basin. Beam-search infrastructure now lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) and dispatches through `CaptionPredictor` alongside the existing greedy path; decode strategy is selectable per inference call and per evaluation run.
224
- - **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
225
- - **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
226
- - **Training stabilization experiments** are introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
227
- - label smoothing (`train.label_smoothing`),
228
- - cosine LR schedule (`train.lr_schedule: cosine`),
229
- - warmup steps (`train.warmup_steps`),
230
- - dropout-free validation path (`train.honour_training_flag_in_test_step`).
231
 
232
- A complete experimental training config β€” not a thin override β€” lives at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml). It is byte-for-byte identical to [`configs/base.yaml`](configs/base.yaml) except for those four flags, so any quality delta between the two runs is attributable to those flags alone.
 
 
 
 
 
 
 
 
 
 
 
 
233
 
234
  ---
235
 
@@ -638,9 +663,8 @@ The repository is evolving from a "research notebook reproduction" into a reprod
638
  ## βš–οΈ Limitations
639
 
640
  - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
641
- - The modular pipeline has not yet reproduced the IEEE notebook's BLEU-4 ~24 on a freshly trained checkpoint; see [Current model quality status](#-current-model-quality-status). The bootstrap weights shipped under [`models/v1.0.0/`](models/v1.0.0/) are intentionally random and exist only for architectural smoke testing.
642
- - Beam search is implemented and selectable, but a head-to-head benchmark against greedy on a real checkpoint is part of in-progress Phase 1b validation, not a published result yet.
643
- - CIDEr / METEOR / ROUGE-L are implemented and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
644
  - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
645
 
646
  These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
 
44
 
45
  > βœ… **Deployed.** Phase 2C (public deployment) is complete. The research β†’ modular conversion (Phase 1) and the full inference stack (Phase 2A backend + 2B frontend) ship as a live, publicly reachable system: a React 19 / Vite 8 SPA at [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) posts multipart uploads to `POST /v1/captions` against a Dockerised FastAPI service running on a HuggingFace Space at [`apoorvrajdev-image-captioning-api.hf.space`](https://apoorvrajdev-image-captioning-api.hf.space), which pulls its versioned weights from [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on the Hub at lifespan startup via `snapshot_download`. The lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check, and a four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C shipped a hardened backend test suite (12 route tests covering the full 200 / 400 / 413 / 415 / 422 / 503 contract via a duck-typed fake predictor, full slice runs in 0.3 s), a multi-stage Dockerfile, Hub-versioned weight loading with an injectable downloader for offline testing, explicit production CORS wired through Space variables, a four-job GitHub Actions CI pipeline (ruff + mypy, pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze, frontend lint + build) plus a chained `deploy-backend.yml` that pushes `main` to the Space remote only after CI is green, and a full deployment runbook at [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). Next up: Phase 3 (multimodal baselines) β€” see [Roadmap](#-roadmap).
46
 
47
+ > πŸ“Š **Trained checkpoint shipped.** The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) was trained on COCO 2017 (95,918 train captions, 24,082 val captions, 10 epochs, Kaggle T4 Γ—2, cosine LR with 500-step warmup, label smoothing 0.1). Results on a 500-sample val2017 slice:
48
+ >
49
+ > | Decode strategy | BLEU-1 | BLEU-4 | ROUGE-L | METEOR | CIDEr |
50
+ > |---|---|---|---|---|---|
51
+ > | Greedy | 42.20 | 10.57 | 37.57 | 15.45 | 0.789 |
52
+ > | Beam (w=4, lp=0.7, rp=1.2) | 41.93 | 10.39 | 36.84 | 15.56 | **0.826** |
53
+ >
54
+ > Full artefacts: [`results/stabilized-greedy/`](results/stabilized-greedy/) and [`results/stabilized-beam-w4-lp07-rp12/`](results/stabilized-beam-w4-lp07-rp12/). The trained weights are hosted on the Hub at [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) and loaded by the backend at startup β€” the live demo now produces real captions.
55
 
56
  ---
57
 
 
65
 
66
  Deployment topology: GitHub `main` β†’ CI on every push β†’ on green, `deploy-backend.yml` pushes to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single uvicorn worker); Vercel's Git integration builds and promotes the SPA in parallel. Production CORS is wired through the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable, not a hardcoded config. Full topology + rollback procedure: [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). CI/CD workflows: [`docs/CI.md`](docs/CI.md).
67
 
68
+ > πŸ’‘ The live demo produces real captions from a COCO-trained checkpoint (CIDEr 0.83). Example: *"a bathroom with a toilet and a sink"*, *"a man riding skis down a snow covered slope"*. See [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl) for 30 sample predictions vs. ground-truth references.
69
 
70
  ---
71
 
 
210
 
211
  ---
212
 
213
+ ## πŸ“Š Model quality β€” stabilized training results
214
 
215
+ The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) converged on COCO 2017 in 10 epochs on Kaggle T4 Γ—2. Training loss dropped monotonically from 4.69 (epoch 1) to 3.33 (epoch 10); validation accuracy climbed from 0.43 to 0.48. No overfitting was observed β€” val_acc was still rising at epoch 10.
216
 
217
+ ### Corpus-level metrics (500-sample val2017 slice)
218
 
219
+ | Metric | Greedy | Beam (w=4, lp=0.7, rp=1.2) |
220
+ |---|---|---|
221
+ | BLEU-1 | 42.20 | 41.93 |
222
+ | BLEU-2 | 26.09 | 25.41 |
223
+ | BLEU-3 | 16.52 | 16.01 |
224
+ | BLEU-4 | 10.57 | 10.39 |
225
+ | ROUGE-L | 37.57 | 36.84 |
226
+ | METEOR | 15.45 | 15.56 |
227
+ | CIDEr | 0.789 | **0.826** |
228
+
229
+ Beam search trades a marginal n-gram overlap regression for a +5% CIDEr lift β€” CIDEr down-weights generic phrases and rewards image-specific vocabulary, making it the better quality signal for captioning. Full artefact sets (metrics, predictions, diagnostics, qualitative samples) are committed under [`results/`](results/).
230
 
231
+ ### Qualitative highlights
232
 
233
+ The model produces fluent, semantically grounded captions with correct object identification across diverse scenes. Sample predictions vs. COCO references (beam decode):
234
 
235
+ | Image | Predicted | Reference | BLEU-4 |
236
+ |---|---|---|---|
237
+ | 000000129379 | a woman sitting on a bench talking on a cell phone | a woman sitting on a cement wall talking on a cell phone | 64.1 |
238
+ | 000000360371 | a white toilet sitting in a bathroom next to a sink | a toilet sitting in a bathroom next to a scale | 69.9 |
239
+ | 000000402020 | a sandwich on a plate on a table | a sandwich on a plate and full wine glass are under blurry lights | 74.2 |
240
+ | 000000082881 | a man riding skis down a snow covered slope | two people ski over a snow covered slope | 29.8 |
241
+ | 000000252596 | a person riding a skateboard down a street | a person skateboards down a street that has greenery on either side | 15.7 |
242
 
243
+ Known failure modes: colour attribute errors (red vs. yellow), count mismatches (one vs. two), generic fallback on unusual compositions. These are expected limitations of a frozen-InceptionV3 encoder and addressable in Phase 3 with modern vision backbones.
 
 
 
 
 
 
 
244
 
245
+ ### Training configuration
246
+
247
+ | Parameter | Value |
248
+ |---|---|
249
+ | Encoder | InceptionV3 (frozen, ImageNet weights) |
250
+ | Decoder | Multi-head Transformer (4 heads, 512-dim) |
251
+ | Data | COCO 2017, 95,918 train / 24,082 val captions |
252
+ | Epochs | 10 (no early stopping triggered) |
253
+ | Batch size | 64 |
254
+ | LR schedule | Cosine decay, peak 0.001, 500-step warmup |
255
+ | Label smoothing | 0.1 |
256
+ | Platform | Kaggle T4 Γ—2, TF 2.19, tf-keras 2.19 (legacy Keras 2 shim) |
257
+ | Wall-clock | ~3.3 hours |
258
 
259
  ---
260
 
 
663
  ## βš–οΈ Limitations
664
 
665
  - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
666
+ - BLEU-4 (10.57 greedy / 10.39 beam) is below the IEEE notebook's reported ~24. The gap is attributable to frozen encoder features and a 10-epoch budget; fine-tuning the encoder or training longer would close it. See [Model quality](#-model-quality--stabilized-training-results) for the full metric table.
667
+ - Colour attribute errors (red vs. yellow), count mismatches (one vs. two), and generic fallback on unusual compositions are the dominant failure modes β€” visible in [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl).
 
668
  - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
669
 
670
  These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).