apoorvrajdev commited on
Commit
a0f0210
Β·
1 Parent(s): 91a1214

docs(readme): document stabilization phase and evaluation pipeline

Browse files
Files changed (1) hide show
  1. README.md +72 -6
README.md CHANGED
@@ -118,6 +118,39 @@ Outputs above are from the IEEE notebook; the modular pipeline reproduces these
118
 
119
  ---
120
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  ## Project structure
122
 
123
  ```
@@ -133,8 +166,9 @@ image-captioning-system/
133
  β”‚ β”œβ”€β”€ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
134
  β”‚ β”‚ transformer_decoder.py Β· captioning_model.py Β· factory.py
135
  β”‚ β”œβ”€β”€ training/ losses.py Β· callbacks.py Β· trainer.py
136
- β”‚ β”œβ”€β”€ inference/ image_loader.py Β· greedy.py Β· predictor.py
137
- β”‚ β”œβ”€β”€ evaluation/ bleu.py
 
138
  β”‚ └── utils/ logging.py Β· seed.py Β· hashing.py
139
  β”‚
140
  β”œβ”€β”€ backend/ # Phase 2A β€” FastAPI inference service
@@ -170,10 +204,13 @@ image-captioning-system/
170
  β”‚
171
  β”œβ”€β”€ configs/
172
  β”‚ β”œβ”€β”€ base.yaml # IEEE hyperparameters (cell 6 mirror)
173
- β”‚ └── train/debug.yaml # CI smoke override
 
 
174
  β”‚
175
  β”œβ”€β”€ scripts/
176
  β”‚ β”œβ”€β”€ train.py Β· evaluate.py Β· predict.py
 
177
  β”‚ β”œβ”€β”€ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
178
  β”‚ └── notebook_module_audit.py # Parity gate vs. notebook
179
  β”‚
@@ -513,17 +550,40 @@ This is what separates this repository from a notebook conversion:
513
  ## Limitations
514
 
515
  - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
516
- - Greedy decoding only; beam search is a Phase 1b addition.
 
 
517
  - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
518
- - BLEU is the only metric in v1; CIDEr / METEOR / ROUGE-L slot into the same runner interface in Phase 1b.
519
 
520
  These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
521
 
522
  ---
523
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
524
  ## Roadmap
525
 
526
- - **Phase 1b** β€” beam search, CIDEr / METEOR / ROUGE-L, masked accuracy parity-fix, label smoothing, warmup + cosine LR schedule.
527
  - **Phase 2A** βœ… β€” FastAPI backend, lifespan-managed predictor singleton, multipart inference endpoint, structured logging + request IDs, Pydantic schemas, Swagger/OpenAPI docs, health/readiness probe.
528
  - **Phase 2B** βœ… β€” React 19 + Vite 8 + Tailwind v4 SPA, drag/drop upload UX, live API integration against `POST /v1/captions`, env-driven `VITE_API_BASE`, `AbortController` timeouts, typed `ApiError` classification, polled health badge with auto-recovery, CORS allow-list wired through the backend YAML config.
529
  - **Phase 2C** β€” Deployment integration: HuggingFace Spaces backend, Vercel-hosted frontend, production CORS allow-list, GitHub Actions CI/CD across both packages.
@@ -548,6 +608,12 @@ Detailed plan: [`docs/restructure-plan.md`](docs/restructure-plan.md).
548
  - Responsive Tailwind v4 inference interface β€” single-column layout under the `lg` breakpoint, sticky header with live status, modular component split under [`frontend/src/components/`](frontend/src/components/).
549
  - Typed API communication β€” SPA consumes the same Pydantic `CaptionResponse` shape the backend emits; caption, `model_version`, `decode_strategy`, `latency_ms`, and `request_id` render directly from the wire payload.
550
  - Production-style frontend architecture β€” dedicated [`services/api.js`](frontend/src/services/api.js) boundary, env-driven `VITE_API_BASE` with safe fallback, lint-clean flat ESLint config, static-asset build via `npm run build`.
 
 
 
 
 
 
551
 
552
  ---
553
 
 
118
 
119
  ---
120
 
121
+ ## Current model quality status
122
+
123
+ The frontend, backend, and inference pipeline are operational end-to-end against the modular package, but **caption quality from the current modular pipeline is still below expectations**. The IEEE notebook reported BLEU-4 ~24; a freshly trained checkpoint produced by the modular trainer has not yet reproduced that figure on COCO. The serving stack is production-style and ready for a real checkpoint β€” what is missing is the checkpoint itself.
124
+
125
+ Current engineering effort is focused on:
126
+
127
+ - **Training stability** β€” diagnosing why early modular training runs collapse onto a small set of high-frequency captions instead of generalising.
128
+ - **Evaluation correctness** β€” moving from a single BLEU score to a full corpus-level metric suite with deterministic tokenisation, so two runs against the same slice are mechanically comparable.
129
+ - **Decoding improvements** β€” replacing greedy-only generation with beam search, repetition controls, and length normalisation.
130
+ - **Reproducible benchmarking** β€” emitting one consistent artefact set per evaluation run so any two runs (or any two models) can be diffed without bespoke parsing per checkpoint.
131
+
132
+ The weights currently committed under [`models/v1.0.0/`](models/v1.0.0/) are the **bootstrap dev artefacts** produced by [`scripts/bootstrap_dev_artifacts.py`](scripts/bootstrap_dev_artifacts.py): the architecture is wired correctly, but every weight is randomly initialised. They exist to exercise the serving stack β€” lifespan, predictor wiring, multipart upload, frontend integration β€” before a real COCO-trained checkpoint is dropped in. Captions returned by the live API today will therefore look like noise; that is the *intended* state of the bootstrap path, not a regression. Poor caption quality at this stage is expected until a properly COCO-trained checkpoint replaces those files.
133
+
134
+ This gap is being addressed through the **stabilized training workflow** introduced at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml), which gates the convergence-stability primitives behind explicit, ablatable flags rather than rewriting the baseline.
135
+
136
+ ### Accuracy investigation (ongoing)
137
+
138
+ The shift from "notebook reproduction" to "modular pipeline that *also* trains well" surfaced several concrete findings, each addressed in code rather than in commentary:
139
+
140
+ - **Greedy decoding limited caption quality and diversity.** Argmax-per-step decoding routinely picked the locally-most-probable token regardless of how that affected the overall sequence likelihood, biasing outputs toward a small "safe captions" basin. Beam-search infrastructure now lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) and dispatches through `CaptionPredictor` alongside the existing greedy path; decode strategy is selectable per inference call and per evaluation run.
141
+ - **BLEU-only evaluation hid behaviour the score did not reflect.** CIDEr, METEOR, and ROUGE-L are implemented under [`src/captioning/evaluation/`](src/captioning/evaluation/) (`cider.py`, `meteor.py`, `rouge.py`) and run through the same corpus-level runner that already produces BLEU-1..4. Every evaluation now emits the full metric set in a single `metrics.json`.
142
+ - **Validation-time dropout parity quirks** inherited from the notebook (`compute_loss_and_acc` ignoring its `training` argument, so dropout stayed active during validation) were identified during the parity audit. They are now gated behind an explicit config flag (`train.honour_training_flag_in_test_step`) so notebook parity is preserved by default and the conventional dropout-free validation path is opt-in via [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
143
+ - **Training stabilization experiments** were introduced as opt-in flags so they can be ablated cleanly rather than entangled with the baseline:
144
+ - label smoothing (`train.label_smoothing`),
145
+ - cosine LR schedule (`train.lr_schedule: cosine`),
146
+ - warmup steps (`train.warmup_steps`),
147
+ - dropout-free validation path (`train.honour_training_flag_in_test_step`).
148
+ - A complete experimental training config β€” not a thin override β€” lives at [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml). It is byte-for-byte identical to [`configs/base.yaml`](configs/base.yaml) except for the four flags above, so any quality delta between the two runs is attributable to those flags alone.
149
+
150
+ These changes are aimed at convergence stability and caption generalisation **before** Phase 3 model upgrades. Comparing the original CNN + Transformer against modern multimodal baselines is only meaningful once the original is trained to the strongest version of itself the architecture can support.
151
+
152
+ ---
153
+
154
  ## Project structure
155
 
156
  ```
 
166
  β”‚ β”œβ”€β”€ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
167
  β”‚ β”‚ transformer_decoder.py Β· captioning_model.py Β· factory.py
168
  β”‚ β”œβ”€β”€ training/ losses.py Β· callbacks.py Β· trainer.py
169
+ β”‚ β”œβ”€β”€ inference/ image_loader.py Β· greedy.py Β· beam.py Β· predictor.py
170
+ β”‚ β”œβ”€β”€ evaluation/ bleu.py Β· cider.py Β· meteor.py Β· rouge.py
171
+ β”‚ β”‚ runner.py Β· benchmark.py Β· inspection.py Β· tokenization.py
172
  β”‚ └── utils/ logging.py Β· seed.py Β· hashing.py
173
  β”‚
174
  β”œβ”€β”€ backend/ # Phase 2A β€” FastAPI inference service
 
204
  β”‚
205
  β”œβ”€β”€ configs/
206
  β”‚ β”œβ”€β”€ base.yaml # IEEE hyperparameters (cell 6 mirror)
207
+ β”‚ └── train/
208
+ β”‚ β”œβ”€β”€ debug.yaml # CI smoke override
209
+ β”‚ └── stabilized.yaml # Phase 1b stability experiment (label smoothing, cosine LR, warmup)
210
  β”‚
211
  β”œβ”€β”€ scripts/
212
  β”‚ β”œβ”€β”€ train.py Β· evaluate.py Β· predict.py
213
+ β”‚ β”œβ”€β”€ inspect_predictions.py # Per-sample diagnostics + diagnostics.jsonl writer
214
  β”‚ β”œβ”€β”€ bootstrap_dev_artifacts.py # Smoke-test artefacts so the API can boot pre-training
215
  β”‚ └── notebook_module_audit.py # Parity gate vs. notebook
216
  β”‚
 
550
  ## Limitations
551
 
552
  - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
553
+ - The modular pipeline has not yet reproduced the IEEE notebook's BLEU-4 ~24 on a freshly trained checkpoint; see [Current model quality status](#current-model-quality-status). The bootstrap weights shipped under [`models/v1.0.0/`](models/v1.0.0/) are intentionally random and exist only for architectural smoke testing.
554
+ - Beam search is implemented ([`inference/beam.py`](src/captioning/inference/beam.py)) and selectable per call/run, but a head-to-head benchmark against greedy on a real checkpoint is part of the in-progress Phase 1b validation, not a published result yet.
555
+ - CIDEr / METEOR / ROUGE-L are implemented ([`evaluation/`](src/captioning/evaluation/)) and emitted into `metrics.json` per run; finalised numbers from the modular pipeline are pending a stabilized COCO-trained checkpoint.
556
  - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
 
557
 
558
  These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
559
 
560
  ---
561
 
562
+ ## Experimental evaluation pipeline
563
+
564
+ The repository is evolving from a "research notebook reproduction" into a reproducible experimentation platform. Evaluation is no longer a single BLEU number printed at the end of training β€” it is a structured set of artefacts that any future run, including the Phase 3 multimodal baselines, can be diffed against.
565
+
566
+ The pieces:
567
+
568
+ - **[`scripts/evaluate.py`](scripts/evaluate.py)** β€” single entrypoint for full corpus evaluation. Loads a checkpoint + tokenizer, runs decoding (greedy or beam) over the COCO validation slice, computes BLEU-1..4 / CIDEr / METEOR / ROUGE-L, and writes a versioned artefact set under `results/<run_id>/`.
569
+ - **[`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)** β€” per-sample diagnostic view. Prints N random predictions vs. references with sentence-level BLEU-4 / ROUGE-L, prediction length, longest repeated-token run, and a set of failure flags (`empty` / `very_short` / `repetitive` / `under_length`). Used when the aggregate metric moves but the qualitative behaviour does not.
570
+ - **Benchmark runner utilities** β€” [`src/captioning/evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py) defines `RunMeta` and `write_run_artifacts(...)`, the contract every evaluation run honours. Phase 3 cross-model comparison code joins multiple `results/<run_id>/` directories without bespoke parsers per model.
571
+ - **Greedy vs. beam evaluation support** β€” the same evaluator accepts `--decode-strategy greedy|beam` plus beam-search controls (`--beam-width`, `--length-penalty`, `--no-repeat-ngram-size`), so a single command-line difference produces directly comparable artefact sets for the same checkpoint. Beam-search implementation lives at [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py).
572
+ - **`metrics.json` outputs** β€” every evaluation writes a typed metric report (BLEU-1..4, ROUGE-L, METEOR, CIDEr) plus run metadata in machine-readable form. The Phase 3 comparison plots will read these files directly; no per-run hand-typing of numbers into spreadsheets.
573
+ - **`diagnostics.jsonl` inspection flow** β€” the same per-sample diagnostic rows that `scripts/inspect_predictions.py` prints to stdout are emitted as JSONL alongside the metrics. The downstream loader is whatever pandas / DuckDB query happens to be useful that day, instead of a bespoke parser per investigation.
574
+
575
+ ### Current limitations
576
+
577
+ - **No fresh fully-trained stabilized checkpoint is committed yet.** The stabilized training workflow exists in code; the resulting weights file does not yet sit under [`models/v1.0.0/`](models/v1.0.0/).
578
+ - **Current repo weights are bootstrap/dev artefacts** β€” see [Current model quality status](#current-model-quality-status). They exist for serving-stack smoke tests, not for producing usable captions.
579
+ - **Benchmark numbers from the modular pipeline are not yet finalized.** The metric harness is in place; the matching checkpoint to publish numbers from is not.
580
+ - **Phase 3 multimodal baselines (BLIP / ViT-GPT2 / GIT) are planned** specifically because the original CNN + Transformer architecture has a quality ceiling that no amount of decoding tuning or schedule tweaking will lift past modern foundation-model baselines. Stabilization here is the floor; Phase 3 is the path past it.
581
+
582
+ ---
583
+
584
  ## Roadmap
585
 
586
+ - **Phase 1b** (in progress) β€” beam search βœ…, CIDEr / METEOR / ROUGE-L βœ… ([`evaluation/cider.py`](src/captioning/evaluation/cider.py), [`meteor.py`](src/captioning/evaluation/meteor.py), [`rouge.py`](src/captioning/evaluation/rouge.py)), stabilized training workflow βœ… ([`configs/train/stabilized.yaml`](configs/train/stabilized.yaml)), evaluation benchmark runner βœ… ([`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py)), prediction inspection tooling βœ… ([`scripts/inspect_predictions.py`](scripts/inspect_predictions.py)). Full retraining + benchmark validation on COCO is still in progress β€” the metric harness is in place, the matching checkpoint is not yet committed.
587
  - **Phase 2A** βœ… β€” FastAPI backend, lifespan-managed predictor singleton, multipart inference endpoint, structured logging + request IDs, Pydantic schemas, Swagger/OpenAPI docs, health/readiness probe.
588
  - **Phase 2B** βœ… β€” React 19 + Vite 8 + Tailwind v4 SPA, drag/drop upload UX, live API integration against `POST /v1/captions`, env-driven `VITE_API_BASE`, `AbortController` timeouts, typed `ApiError` classification, polled health badge with auto-recovery, CORS allow-list wired through the backend YAML config.
589
  - **Phase 2C** β€” Deployment integration: HuggingFace Spaces backend, Vercel-hosted frontend, production CORS allow-list, GitHub Actions CI/CD across both packages.
 
608
  - Responsive Tailwind v4 inference interface β€” single-column layout under the `lg` breakpoint, sticky header with live status, modular component split under [`frontend/src/components/`](frontend/src/components/).
609
  - Typed API communication β€” SPA consumes the same Pydantic `CaptionResponse` shape the backend emits; caption, `model_version`, `decode_strategy`, `latency_ms`, and `request_id` render directly from the wire payload.
610
  - Production-style frontend architecture β€” dedicated [`services/api.js`](frontend/src/services/api.js) boundary, env-driven `VITE_API_BASE` with safe fallback, lint-clean flat ESLint config, static-asset build via `npm run build`.
611
+ - Beam-search decoding β€” [`src/captioning/inference/beam.py`](src/captioning/inference/beam.py) dispatched through `CaptionPredictor` alongside greedy, with length penalty, repetition penalty, and no-repeat n-gram blocking.
612
+ - Multi-metric evaluation β€” corpus BLEU-1..4 plus CIDEr / METEOR / ROUGE-L under a single runner ([`src/captioning/evaluation/`](src/captioning/evaluation/)), emitted as `metrics.json` per run.
613
+ - Benchmark runner β€” versioned `results/<run_id>/` artefact contract via [`evaluation/benchmark.py`](src/captioning/evaluation/benchmark.py), designed so Phase 3 cross-model comparison can join runs without bespoke parsers.
614
+ - Prediction inspection tooling β€” [`scripts/inspect_predictions.py`](scripts/inspect_predictions.py) for per-sample sentence-level BLEU / ROUGE-L, length and repetition diagnostics, and failure-flag breakdown.
615
+ - Stabilized training configs β€” opt-in label smoothing, cosine LR schedule, warmup steps, and dropout-free validation behind explicit flags in [`configs/train/stabilized.yaml`](configs/train/stabilized.yaml).
616
+ - Reproducible evaluation pipeline β€” `metrics.json` + `predictions.jsonl` + `diagnostics.jsonl` + `run_meta.json` + `report.md` per run, so any two runs can be diffed mechanically rather than re-typed into a spreadsheet.
617
 
618
  ---
619