Spaces:
Configuration error
Configuration error
Commit Β·
5cb7968
1
Parent(s): 540e3d5
docs(readme): replace dev-scaffold disclaimer with trained model metrics
Browse files
README.md
CHANGED
|
@@ -44,7 +44,14 @@ short_description: InceptionV3 + Transformer image captioning inference API
|
|
| 44 |
|
| 45 |
> β
**Deployed.** Phase 2C (public deployment) is complete. The research β modular conversion (Phase 1) and the full inference stack (Phase 2A backend + 2B frontend) ship as a live, publicly reachable system: a React 19 / Vite 8 SPA at [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) posts multipart uploads to `POST /v1/captions` against a Dockerised FastAPI service running on a HuggingFace Space at [`apoorvrajdev-image-captioning-api.hf.space`](https://apoorvrajdev-image-captioning-api.hf.space), which pulls its versioned weights from [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on the Hub at lifespan startup via `snapshot_download`. The lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check, and a four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C shipped a hardened backend test suite (12 route tests covering the full 200 / 400 / 413 / 415 / 422 / 503 contract via a duck-typed fake predictor, full slice runs in 0.3 s), a multi-stage Dockerfile, Hub-versioned weight loading with an injectable downloader for offline testing, explicit production CORS wired through Space variables, a four-job GitHub Actions CI pipeline (ruff + mypy, pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze, frontend lint + build) plus a chained `deploy-backend.yml` that pushes `main` to the Space remote only after CI is green, and a full deployment runbook at [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). Next up: Phase 3 (multimodal baselines) β see [Roadmap](#-roadmap).
|
| 46 |
|
| 47 |
-
>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
@@ -58,7 +65,7 @@ short_description: InceptionV3 + Transformer image captioning inference API
|
|
| 58 |
|
| 59 |
Deployment topology: GitHub `main` β CI on every push β on green, `deploy-backend.yml` pushes to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single uvicorn worker); Vercel's Git integration builds and promotes the SPA in parallel. Production CORS is wired through the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable, not a hardcoded config. Full topology + rollback procedure: [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). CI/CD workflows: [`docs/CI.md`](docs/CI.md).
|
| 60 |
|
| 61 |
-
>
|
| 62 |
|
| 63 |
---
|
| 64 |
|
|
@@ -203,33 +210,51 @@ The notebook is preserved verbatim as the canonical research artefact. Improveme
|
|
| 203 |
|
| 204 |
---
|
| 205 |
|
| 206 |
-
##
|
| 207 |
|
| 208 |
-
The
|
| 209 |
|
| 210 |
-
|
| 211 |
|
| 212 |
-
|
| 213 |
-
-
|
| 214 |
-
|
| 215 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
| 218 |
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
|
| 224 |
-
- **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
|
| 225 |
-
- **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
|
| 226 |
-
- **Training stabilization experiments** are introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
|
| 227 |
-
- label smoothing (`train.label_smoothing`),
|
| 228 |
-
- cosine LR schedule (`train.lr_schedule: cosine`),
|
| 229 |
-
- warmup steps (`train.warmup_steps`),
|
| 230 |
-
- dropout-free validation path (`train.honour_training_flag_in_test_step`).
|
| 231 |
|
| 232 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
|
| 234 |
---
|
| 235 |
|
|
@@ -638,9 +663,8 @@ The repository is evolving from a "research notebook reproduction" into a reprod
|
|
| 638 |
## βοΈ Limitations
|
| 639 |
|
| 640 |
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
|
| 641 |
-
-
|
| 642 |
-
-
|
| 643 |
-
- CIDEr / METEOR / ROUGE-L are implemented and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
|
| 644 |
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
|
| 645 |
|
| 646 |
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
|
|
|
|
| 44 |
|
| 45 |
> β
**Deployed.** Phase 2C (public deployment) is complete. The research β modular conversion (Phase 1) and the full inference stack (Phase 2A backend + 2B frontend) ship as a live, publicly reachable system: a React 19 / Vite 8 SPA at [`image-captioning-system.vercel.app`](https://image-captioning-system.vercel.app) posts multipart uploads to `POST /v1/captions` against a Dockerised FastAPI service running on a HuggingFace Space at [`apoorvrajdev-image-captioning-api.hf.space`](https://apoorvrajdev-image-captioning-api.hf.space), which pulls its versioned weights from [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) on the Hub at lifespan startup via `snapshot_download`. The lifespan-managed `CaptionPredictor` is reused across every request with a warm graph and no per-call TF rebuilds. The IEEE notebook is preserved verbatim and protected by a SHA-256 freeze check, and a four-stage parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)) re-implements caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, and the decoder forward pass inline and asserts the modular path is byte-identical (or `tf.allclose`-identical) to the notebook. Phase 1b (training stabilization) shipped beam search, the full corpus metric suite (BLEU-1..4 / CIDEr / METEOR / ROUGE-L), a benchmark runner that emits one machine-readable artefact set per evaluation, and a stabilized training config that gates label smoothing / cosine LR / warmup / dropout-free validation behind ablatable flags. Phase 2C shipped a hardened backend test suite (12 route tests covering the full 200 / 400 / 413 / 415 / 422 / 503 contract via a duck-typed fake predictor, full slice runs in 0.3 s), a multi-stage Dockerfile, Hub-versioned weight loading with an injectable downloader for offline testing, explicit production CORS wired through Space variables, a four-job GitHub Actions CI pipeline (ruff + mypy, pytest matrix on 3.10/3.11/3.12, notebook SHA-256 freeze, frontend lint + build) plus a chained `deploy-backend.yml` that pushes `main` to the Space remote only after CI is green, and a full deployment runbook at [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). Next up: Phase 3 (multimodal baselines) β see [Roadmap](#-roadmap).
|
| 46 |
|
| 47 |
+
> π **Trained checkpoint shipped.** The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) was trained on COCO 2017 (95,918 train captions, 24,082 val captions, 10 epochs, Kaggle T4 Γ2, cosine LR with 500-step warmup, label smoothing 0.1). Results on a 500-sample val2017 slice:
|
| 48 |
+
>
|
| 49 |
+
> | Decode strategy | BLEU-1 | BLEU-4 | ROUGE-L | METEOR | CIDEr |
|
| 50 |
+
> |---|---|---|---|---|---|
|
| 51 |
+
> | Greedy | 42.20 | 10.57 | 37.57 | 15.45 | 0.789 |
|
| 52 |
+
> | Beam (w=4, lp=0.7, rp=1.2) | 41.93 | 10.39 | 36.84 | 15.56 | **0.826** |
|
| 53 |
+
>
|
| 54 |
+
> Full artefacts: [`results/stabilized-greedy/`](results/stabilized-greedy/) and [`results/stabilized-beam-w4-lp07-rp12/`](results/stabilized-beam-w4-lp07-rp12/). The trained weights are hosted on the Hub at [`apoorvrajdev/captioning-inceptionv3-transformer`](https://huggingface.co/apoorvrajdev/captioning-inceptionv3-transformer) and loaded by the backend at startup β the live demo now produces real captions.
|
| 55 |
|
| 56 |
---
|
| 57 |
|
|
|
|
| 65 |
|
| 66 |
Deployment topology: GitHub `main` β CI on every push β on green, `deploy-backend.yml` pushes to a HuggingFace Space (Docker SDK, cpu-basic, port 7860, single uvicorn worker); Vercel's Git integration builds and promotes the SPA in parallel. Production CORS is wired through the Space's `CAPTIONING__SERVE__CORS_ALLOWED_ORIGINS` variable, not a hardcoded config. Full topology + rollback procedure: [`docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md`](docs/PHASE_2C_DEPLOYMENT_RUNBOOK.md). CI/CD workflows: [`docs/CI.md`](docs/CI.md).
|
| 67 |
|
| 68 |
+
> π‘ The live demo produces real captions from a COCO-trained checkpoint (CIDEr 0.83). Example: *"a bathroom with a toilet and a sink"*, *"a man riding skis down a snow covered slope"*. See [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl) for 30 sample predictions vs. ground-truth references.
|
| 69 |
|
| 70 |
---
|
| 71 |
|
|
|
|
| 210 |
|
| 211 |
---
|
| 212 |
|
| 213 |
+
## π Model quality β stabilized training results
|
| 214 |
|
| 215 |
+
The stabilized training config ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)) converged on COCO 2017 in 10 epochs on Kaggle T4 Γ2. Training loss dropped monotonically from 4.69 (epoch 1) to 3.33 (epoch 10); validation accuracy climbed from 0.43 to 0.48. No overfitting was observed β val_acc was still rising at epoch 10.
|
| 216 |
|
| 217 |
+
### Corpus-level metrics (500-sample val2017 slice)
|
| 218 |
|
| 219 |
+
| Metric | Greedy | Beam (w=4, lp=0.7, rp=1.2) |
|
| 220 |
+
|---|---|---|
|
| 221 |
+
| BLEU-1 | 42.20 | 41.93 |
|
| 222 |
+
| BLEU-2 | 26.09 | 25.41 |
|
| 223 |
+
| BLEU-3 | 16.52 | 16.01 |
|
| 224 |
+
| BLEU-4 | 10.57 | 10.39 |
|
| 225 |
+
| ROUGE-L | 37.57 | 36.84 |
|
| 226 |
+
| METEOR | 15.45 | 15.56 |
|
| 227 |
+
| CIDEr | 0.789 | **0.826** |
|
| 228 |
+
|
| 229 |
+
Beam search trades a marginal n-gram overlap regression for a +5% CIDEr lift β CIDEr down-weights generic phrases and rewards image-specific vocabulary, making it the better quality signal for captioning. Full artefact sets (metrics, predictions, diagnostics, qualitative samples) are committed under [`results/`](results/).
|
| 230 |
|
| 231 |
+
### Qualitative highlights
|
| 232 |
|
| 233 |
+
The model produces fluent, semantically grounded captions with correct object identification across diverse scenes. Sample predictions vs. COCO references (beam decode):
|
| 234 |
|
| 235 |
+
| Image | Predicted | Reference | BLEU-4 |
|
| 236 |
+
|---|---|---|---|
|
| 237 |
+
| 000000129379 | a woman sitting on a bench talking on a cell phone | a woman sitting on a cement wall talking on a cell phone | 64.1 |
|
| 238 |
+
| 000000360371 | a white toilet sitting in a bathroom next to a sink | a toilet sitting in a bathroom next to a scale | 69.9 |
|
| 239 |
+
| 000000402020 | a sandwich on a plate on a table | a sandwich on a plate and full wine glass are under blurry lights | 74.2 |
|
| 240 |
+
| 000000082881 | a man riding skis down a snow covered slope | two people ski over a snow covered slope | 29.8 |
|
| 241 |
+
| 000000252596 | a person riding a skateboard down a street | a person skateboards down a street that has greenery on either side | 15.7 |
|
| 242 |
|
| 243 |
+
Known failure modes: colour attribute errors (red vs. yellow), count mismatches (one vs. two), generic fallback on unusual compositions. These are expected limitations of a frozen-InceptionV3 encoder and addressable in Phase 3 with modern vision backbones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 244 |
|
| 245 |
+
### Training configuration
|
| 246 |
+
|
| 247 |
+
| Parameter | Value |
|
| 248 |
+
|---|---|
|
| 249 |
+
| Encoder | InceptionV3 (frozen, ImageNet weights) |
|
| 250 |
+
| Decoder | Multi-head Transformer (4 heads, 512-dim) |
|
| 251 |
+
| Data | COCO 2017, 95,918 train / 24,082 val captions |
|
| 252 |
+
| Epochs | 10 (no early stopping triggered) |
|
| 253 |
+
| Batch size | 64 |
|
| 254 |
+
| LR schedule | Cosine decay, peak 0.001, 500-step warmup |
|
| 255 |
+
| Label smoothing | 0.1 |
|
| 256 |
+
| Platform | Kaggle T4 Γ2, TF 2.19, tf-keras 2.19 (legacy Keras 2 shim) |
|
| 257 |
+
| Wall-clock | ~3.3 hours |
|
| 258 |
|
| 259 |
---
|
| 260 |
|
|
|
|
| 663 |
## βοΈ Limitations
|
| 664 |
|
| 665 |
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines for side-by-side comparison.
|
| 666 |
+
- BLEU-4 (10.57 greedy / 10.39 beam) is below the IEEE notebook's reported ~24. The gap is attributable to frozen encoder features and a 10-epoch budget; fine-tuning the encoder or training longer would close it. See [Model quality](#-model-quality--stabilized-training-results) for the full metric table.
|
| 667 |
+
- Colour attribute errors (red vs. yellow), count mismatches (one vs. two), and generic fallback on unusual compositions are the dominant failure modes β visible in [`results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl`](results/stabilized-beam-w4-lp07-rp12/qualitative.jsonl).
|
|
|
|
| 668 |
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
|
| 669 |
|
| 670 |
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
|