Spaces:
Configuration error
Configuration error
Commit Β·
a0f0210
1
Parent(s): 91a1214
docs(readme): document stabilization phase and evaluation pipeline
Browse files
README.md
CHANGED
|
@@ -118,6 +118,39 @@ Outputs above are from the IEEE notebook; the modular pipeline reproduces these
|
|
| 118 |
|
| 119 |
---
|
| 120 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
## Project structure
|
| 122 |
|
| 123 |
```
|
|
@@ -133,8 +166,9 @@ image-captioning-system/
|
|
| 133 |
β βββ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
|
| 134 |
β β transformer_decoder.py Β· captioning_model.py Β· factory.py
|
| 135 |
β βββ training/ losses.py Β· callbacks.py Β· trainer.py
|
| 136 |
-
β βββ inference/ image_loader.py Β· greedy.py Β· predictor.py
|
| 137 |
-
β βββ evaluation/ bleu.py
|
|
|
|
| 138 |
β βββ utils/ logging.py Β· seed.py Β· hashing.py
|
| 139 |
β
|
| 140 |
βββ backend/ # Phase 2A β FastAPI inference service
|
|
@@ -170,10 +204,13 @@ image-captioning-system/
|
|
| 170 |
β
|
| 171 |
βββ configs/
|
| 172 |
β βββ base.yaml # IEEE hyperparameters (cell 6 mirror)
|
| 173 |
-
β βββ train/
|
|
|
|
|
|
|
| 174 |
β
|
| 175 |
βββ scripts/
|
| 176 |
β βββ train.py Β· evaluate.py Β· predict.py
|
|
|
|
| 177 |
β βββ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
|
| 178 |
β βββ notebook_module_audit.py # Parity gate vs. notebook
|
| 179 |
β
|
|
@@ -513,17 +550,40 @@ This is what separates this repository from a notebook conversion:
|
|
| 513 |
## Limitations
|
| 514 |
|
| 515 |
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
|
| 516 |
-
-
|
|
|
|
|
|
|
| 517 |
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
|
| 518 |
-
- BLEU is the only metric in v1; CIDEr / METEOR / ROUGE-L slot into the same runner interface in Phase 1b.
|
| 519 |
|
| 520 |
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
|
| 521 |
|
| 522 |
---
|
| 523 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 524 |
## Roadmap
|
| 525 |
|
| 526 |
-
- **Phase 1b** β beam search, CIDEr / METEOR / ROUGE-L,
|
| 527 |
- **Phase 2A** β
β FastAPI backend, lifespan-managed predictor singleton, multipart inference endpoint, structured logging + request IDs, Pydantic schemas, Swagger/OpenAPI docs, health/readiness probe.
|
| 528 |
- **Phase 2B** β
β React 19 + Vite 8 + Tailwind v4 SPA, drag/drop upload UX, live API integration against `POST /v1/captions`, env-driven `VITE_API_BASE`, `AbortController` timeouts, typed `ApiError` classification, polled health badge with auto-recovery, CORS allow-list wired through the backend YAML config.
|
| 529 |
- **Phase 2C** β Deployment integration: HuggingFace Spaces backend, Vercel-hosted frontend, production CORS allow-list, GitHub Actions CI/CD across both packages.
|
|
@@ -548,6 +608,12 @@ Detailed plan: [`docs/restructure-plan.md`](docs/restructure-plan.md).
|
|
| 548 |
- Responsive Tailwind v4 inference interface β single-column layout under the `lg` breakpoint, sticky header with live status, modular component split under [`frontend/src/components/`](frontend/src/components/).
|
| 549 |
- Typed API communication β SPA consumes the same Pydantic `CaptionResponse` shape the backend emits; caption, `model_version`, `decode_strategy`, `latency_ms`, and `request_id` render directly from the wire payload.
|
| 550 |
- Production-style frontend architecture β dedicated [`services/api.js`](frontend/src/services/api.js) boundary, env-driven `VITE_API_BASE` with safe fallback, lint-clean flat ESLint config, static-asset build via `npm run build`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 551 |
|
| 552 |
---
|
| 553 |
|
|
|
|
| 118 |
|
| 119 |
---
|
| 120 |
|
| 121 |
+
## Current model quality status
|
| 122 |
+
|
| 123 |
+
The frontend, backend, and inference pipeline are operational end-to-end against the modular package, but **caption quality from the current modular pipeline is still below expectations**. The IEEE notebook reported BLEU-4 ~24; a freshly trained checkpoint produced by the modular trainer has not yet reproduced that figure on COCO. The serving stack is production-style and ready for a real checkpoint β what is missing is the checkpoint itself.
|
| 124 |
+
|
| 125 |
+
Current engineering effort is focused on:
|
| 126 |
+
|
| 127 |
+
- **Training stability** β diagnosing why early modular training runs collapse onto a small set of high-frequency captions instead of generalising.
|
| 128 |
+
- **Evaluation correctness** β moving from a single BLEU score to a full corpus-level metric suite with deterministic tokenisation, so two runs against the same slice are mechanically comparable.
|
| 129 |
+
- **Decoding improvements** β replacing greedy-only generation with beam search, repetition controls, and length normalisation.
|
| 130 |
+
- **Reproducible benchmarking** β emitting one consistent artefact set per evaluation run so any two runs (or any two models) can be diffed without bespoke parsing per checkpoint.
|
| 131 |
+
|
| 132 |
+
The weights currently committed under [`models/v1.0.0/`](models/v1.0.0/) are the **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py): the architecture is wired correctly, but every weight is randomly initialised. They exist to exercise the serving stack β lifespan, predictor wiring, multipart upload, frontend integration β before a real COCO-trained checkpoint is dropped in. Captions returned by the live API today will therefore look like noise; that is the *intended* state of the bootstrap path, not a regression. Poor caption quality at this stage is expected until a properly COCO-trained checkpoint replaces those files.
|
| 133 |
+
|
| 134 |
+
This gap is being addressed through the **stabilized training workflow** introduced at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml), which gates the convergence-stability primitives behind explicit, ablatable flags rather than rewriting the baseline.
|
| 135 |
+
|
| 136 |
+
### Accuracy investigation (ongoing)
|
| 137 |
+
|
| 138 |
+
The shift from "notebook reproduction" to "modular pipeline that *also* trains well" surfaced several concrete findings, each addressed in code rather than in commentary:
|
| 139 |
+
|
| 140 |
+
- **Greedy decoding limited caption quality and diversity.** Argmax-per-step decoding routinely picked the locally-most-probable token regardless of how that affected the overall sequence likelihood, biasing outputs toward a small "safe captions" basin. Beam-search infrastructure now lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) and dispatches through `CaptionPredictor` alongside the existing greedy path; decode strategy is selectable per inference call and per evaluation run.
|
| 141 |
+
- **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) (`cider.py`, `meteor.py`, `rouge.py`) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
|
| 142 |
+
- **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
|
| 143 |
+
- **Training stabilization experiments** were introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
|
| 144 |
+
- label smoothing (`train.label_smoothing`),
|
| 145 |
+
- cosine LR schedule (`train.lr_schedule: cosine`),
|
| 146 |
+
- warmup steps (`train.warmup_steps`),
|
| 147 |
+
- dropout-free validation path (`train.honour_training_flag_in_test_step`).
|
| 148 |
+
- A complete experimental training config β not a thin override β lives at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml). It is byte-for-byte identical to [`configs/base.yaml`](configs/base.yaml) except for the four flags above, so any quality delta between the two runs is attributable to those flags alone.
|
| 149 |
+
|
| 150 |
+
These changes are aimed at convergence stability and caption generalisation **before** Phase 3 model upgrades. Comparing the original CNN + Transformer against modern multimodal baselines is only meaningful once the original is trained to the strongest version of itself the architecture can support.
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
## Project structure
|
| 155 |
|
| 156 |
```
|
|
|
|
| 166 |
β βββ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
|
| 167 |
β β transformer_decoder.py Β· captioning_model.py Β· factory.py
|
| 168 |
β βββ training/ losses.py Β· callbacks.py Β· trainer.py
|
| 169 |
+
β βββ inference/ image_loader.py Β· greedy.py Β· beam.py Β· predictor.py
|
| 170 |
+
β βββ evaluation/ bleu.py Β· cider.py Β· meteor.py Β· rouge.py
|
| 171 |
+
β β runner.py Β· benchmark.py Β· inspection.py Β· tokenization.py
|
| 172 |
β βββ utils/ logging.py Β· seed.py Β· hashing.py
|
| 173 |
β
|
| 174 |
βββ backend/ # Phase 2A β FastAPI inference service
|
|
|
|
| 204 |
β
|
| 205 |
βββ configs/
|
| 206 |
β βββ base.yaml # IEEE hyperparameters (cell 6 mirror)
|
| 207 |
+
β βββ train/
|
| 208 |
+
β βββ debug.yaml # CI smoke override
|
| 209 |
+
β βββ stabilized.yaml # Phase 1b stability experiment (label smoothing, cosine LR, warmup)
|
| 210 |
β
|
| 211 |
βββ scripts/
|
| 212 |
β βββ train.py Β· evaluate.py Β· predict.py
|
| 213 |
+
β βββ inspect_predictions.py # Per-sample diagnostics + diagnostics.jsonl writer
|
| 214 |
β βββ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
|
| 215 |
β βββ notebook_module_audit.py # Parity gate vs. notebook
|
| 216 |
β
|
|
|
|
| 550 |
## Limitations
|
| 551 |
|
| 552 |
- The model produces generic captions on cluttered or rare-object scenes β a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
|
| 553 |
+
- The modular pipeline has not yet reproduced the IEEE notebook's BLEU-4 ~24 on a freshly trained checkpoint; see [Current model quality status](#current-model-quality-status). The bootstrap weights shipped under [`models/v1.0.0/`](models/v1.0.0/) are intentionally random and exist only for architectural smoke testing.
|
| 554 |
+
- Beam search is implemented ([`inference/beam.py`](src/captioning/inference/beam.py)) and selectable per call/run, but a head-to-head benchmark against greedy on a real checkpoint is part of the in-progress Phase 1b validation, not a published result yet.
|
| 555 |
+
- CIDEr / METEOR / ROUGE-L are implemented ([`evaluation/`](src/captioning/evaluation/)) and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
|
| 556 |
- Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
|
|
|
|
| 557 |
|
| 558 |
These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
|
| 559 |
|
| 560 |
---
|
| 561 |
|
| 562 |
+
## Experimental evaluation pipeline
|
| 563 |
+
|
| 564 |
+
The repository is evolving from a "research notebook reproduction" into a reproducible experimentation platform. Evaluation is no longer a single BLEU number printed at the end of training β it is a structured set of artefacts that any future run, including the Phase 3 multimodal baselines, can be diffed against.
|
| 565 |
+
|
| 566 |
+
The pieces:
|
| 567 |
+
|
| 568 |
+
- **[`scripts/evaluate.py`](scripts/evaluate.py)** β single entrypoint for full corpus evaluation. Loads a checkpoint + tokenizer, runs decoding (greedy or beam) over the COCO validation slice, computes BLEU-1..4 / CIDEr / METEOR / ROUGE-L, and writes a versioned artefact set under `results/<run_id>/`.
|
| 569 |
+
- **[`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)** β per-sample diagnostic view. Prints N random predictions vs. references with sentence-level BLEU-4 / ROUGE-L, prediction length, longest repeated-token run, and a set of failure flags (`empty` / `very_short` / `repetitive` / `under_length`). Used when the aggregate metric moves but the qualitative behaviour does not.
|
| 570 |
+
- **Benchmark runner utilities** β [`src/captioning/evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py) defines `RunMeta` and `write_run_artifacts(...)`, the contract every evaluation run honours. Phase 3 cross-model comparison code joins multiple `results/<run_id>/` directories without bespoke parsers per model.
|
| 571 |
+
- **Greedy vs. beam evaluation support** β the same evaluator accepts `--decode-strategy greedy|beam` plus beam-search controls (`--beam-width`, `--length-penalty`, `--no-repeat-ngram-size`), so a single command-line difference produces directly comparable artefact sets for the same checkpoint. Beam-search implementation lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py).
|
| 572 |
+
- **`metrics.json` outputs** β every evaluation writes a typed metric report (BLEU-1..4, ROUGE-L, METEOR, CIDEr) plus run metadata in machine-readable form. The Phase 3 comparison plots will read these files directly; no per-run hand-typing of numbers into spreadsheets.
|
| 573 |
+
- **`diagnostics.jsonl` inspection flow** β the same per-sample diagnostic rows that `scripts/inspect_predictions.py` prints to stdout are emitted as JSONL alongside the metrics. The downstream loader is whatever pandas / DuckDB query happens to be useful that day, instead of a bespoke parser per investigation.
|
| 574 |
+
|
| 575 |
+
### Current limitations
|
| 576 |
+
|
| 577 |
+
- **No fresh fully-trained stabilized checkpoint is committed yet.** The stabilized training workflow exists in code; the resulting weights file does not yet sit under [`models/v1.0.0/`](models/v1.0.0/).
|
| 578 |
+
- **Current repo weights are bootstrap/dev artefacts** β see [Current model quality status](#current-model-quality-status). They exist for serving-stack smoke tests, not for producing usable captions.
|
| 579 |
+
- **Benchmark numbers from the modular pipeline are not yet finalized.** The metric harness is in place; the matching checkpoint to publish numbers from is not.
|
| 580 |
+
- **Phase 3 multimodal baselines (BLIP / ViT-GPT2 / GIT) are planned** specifically because the original CNN + Transformer architecture has a quality ceiling that no amount of decoding tuning or schedule tweaking will lift past modern foundation-model baselines. Stabilization here is the floor; Phase 3 is the path past it.
|
| 581 |
+
|
| 582 |
+
---
|
| 583 |
+
|
| 584 |
## Roadmap
|
| 585 |
|
| 586 |
+
- **Phase 1b** (in progress) β beam search β
, CIDEr / METEOR / ROUGE-L β
([`evaluation/cider.py`](src/captioning/evaluation/cider.py), [`meteor.py`](src/captioning/evaluation/meteor.py), [`rouge.py`](src/captioning/evaluation/rouge.py)), stabilized training workflow β
([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)), evaluation benchmark runner β
([`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)), prediction inspection tooling β
([`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)). Full retraining + benchmark validation on COCO is still in progress β the metric harness is in place, the matching checkpoint is not yet committed.
|
| 587 |
- **Phase 2A** β
β FastAPI backend, lifespan-managed predictor singleton, multipart inference endpoint, structured logging + request IDs, Pydantic schemas, Swagger/OpenAPI docs, health/readiness probe.
|
| 588 |
- **Phase 2B** β
β React 19 + Vite 8 + Tailwind v4 SPA, drag/drop upload UX, live API integration against `POST /v1/captions`, env-driven `VITE_API_BASE`, `AbortController` timeouts, typed `ApiError` classification, polled health badge with auto-recovery, CORS allow-list wired through the backend YAML config.
|
| 589 |
- **Phase 2C** β Deployment integration: HuggingFace Spaces backend, Vercel-hosted frontend, production CORS allow-list, GitHub Actions CI/CD across both packages.
|
|
|
|
| 608 |
- Responsive Tailwind v4 inference interface β single-column layout under the `lg` breakpoint, sticky header with live status, modular component split under [`frontend/src/components/`](frontend/src/components/).
|
| 609 |
- Typed API communication β SPA consumes the same Pydantic `CaptionResponse` shape the backend emits; caption, `model_version`, `decode_strategy`, `latency_ms`, and `request_id` render directly from the wire payload.
|
| 610 |
- Production-style frontend architecture β dedicated [`services/api.js`](frontend/src/services/api.js) boundary, env-driven `VITE_API_BASE` with safe fallback, lint-clean flat ESLint config, static-asset build via `npm run build`.
|
| 611 |
+
- Beam-search decoding β [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) dispatched through `CaptionPredictor` alongside greedy, with length penalty, repetition penalty, and no-repeat n-gram blocking.
|
| 612 |
+
- Multi-metric evaluation β corpus BLEU-1..4 plus CIDEr / METEOR / ROUGE-L under a single runner ([`src/captioning/evaluation/`](src/captioning/evaluation/)), emitted as `metrics.json` per run.
|
| 613 |
+
- Benchmark runner β versioned `results/<run_id>/` artefact contract via [`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py), designed so Phase 3 cross-model comparison can join runs without bespoke parsers.
|
| 614 |
+
- Prediction inspection tooling β [`scripts/inspect_predictions.py`](scripts/inspect_predictions.py) for per-sample sentence-level BLEU / ROUGE-L, length and repetition diagnostics, and failure-flag breakdown.
|
| 615 |
+
- Stabilized training configs β opt-in label smoothing, cosine LR schedule, warmup steps, and dropout-free validation behind explicit flags in [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
|
| 616 |
+
- Reproducible evaluation pipeline β `metrics.json` + `predictions.jsonl` + `diagnostics.jsonl` + `run_meta.json` + `report.md` per run, so any two runs can be diffed mechanically rather than re-typed into a spreadsheet.
|
| 617 |
|
| 618 |
---
|
| 619 |
|