Spaces:
Configuration error
Configuration error
| # Phase 1 — Modularisation (closeout) | |
| > Phase 1 lifts every line of code out of the IEEE notebook into a proper | |
| > Python package, behind a parity validation gate. No behaviour changes — | |
| > the same hyperparameters, the same TF ops, the same losses, the same | |
| > generation algorithm. What changes is *structure*: testable, reusable, and | |
| > ready for FastAPI to import directly in Phase 2. | |
| ## Updated folder structure | |
| ``` | |
| src/captioning/ | |
| ├── __init__.py # Public API + version | |
| ├── py.typed # PEP 561 marker — package ships type hints | |
| │ | |
| ├── config/ # Typed configuration (Pydantic v2) | |
| │ ├── __init__.py | |
| │ ├── schema.py # AppConfig, ModelConfig, TrainConfig, DataConfig, ServeConfig | |
| │ └── loader.py # load_config(yaml_path) -> AppConfig | |
| │ | |
| ├── preprocessing/ # Pure, stateless transforms (TRAIN ↔ SERVE shared) | |
| │ ├── __init__.py | |
| │ ├── caption.py # preprocess_caption — notebook cell 3 | |
| │ ├── image.py # preprocess_image_tensor + load_and_preprocess_image | |
| │ ├── tokenizer.py # CaptionTokenizer (wraps TextVectorization) | |
| │ └── augmentation.py # default_image_augmentation — notebook cell 15 | |
| │ | |
| ├── data/ # Stateful: I/O + dataset construction | |
| │ ├── __init__.py | |
| │ ├── coco.py # load_coco_annotations — notebook cell 2 | |
| │ ├── splits.py # make_image_level_splits — notebook cell 11 | |
| │ └── pipeline.py # build_train/val_pipeline — notebook cells 13-14 | |
| │ | |
| ├── models/ # Architecture (TF/Keras layers + top-level model) | |
| │ ├── __init__.py | |
| │ ├── encoder_cnn.py # InceptionV3 backbone — notebook cell 16 | |
| │ ├── transformer_encoder.py # 1-layer encoder — notebook cell 17 | |
| │ ├── embeddings.py # token + positional — notebook cell 18 | |
| │ ├── transformer_decoder.py # multi-head causal decoder — notebook cell 19 | |
| │ ├── captioning_model.py # ImageCaptioningModel — notebook cell 20 | |
| │ └── factory.py # build_caption_model(config, vocab_size) — notebook cell 21 | |
| │ | |
| ├── training/ # Loss, callbacks, orchestration | |
| │ ├── __init__.py | |
| │ ├── losses.py # masked_sparse_categorical_crossentropy — notebook cell 22 | |
| │ ├── callbacks.py # EarlyStopping (+ Phase 1b ModelCheckpoint, CSVLogger) | |
| │ └── trainer.py # Trainer.fit — notebook cell 23 | |
| │ | |
| ├── inference/ # Generation + FastAPI-friendly singleton | |
| │ ├── __init__.py | |
| │ ├── image_loader.py # load_image_from_path — notebook cell 25 | |
| │ ├── greedy.py # generate_caption_greedy — notebook cell 25 | |
| │ └── predictor.py # CaptionPredictor (Phase 2 FastAPI imports this) | |
| │ | |
| ├── evaluation/ # Caption-quality metrics | |
| │ ├── __init__.py | |
| │ └── bleu.py # corpus BLEU-4 via sacrebleu (Phase 1b adds CIDEr/METEOR/ROUGE) | |
| │ | |
| └── utils/ # Cross-cutting helpers | |
| ├── __init__.py | |
| ├── logging.py # structlog (JSON in prod, pretty in dev) | |
| ├── seed.py # set_global_seed | |
| └── hashing.py # sha256_file (paper-notebook freeze) | |
| configs/ | |
| ├── base.yaml # Mirrors notebook cell 6 hyperparams | |
| └── train/debug.yaml # CI smoke override (1 epoch, batch 8) | |
| scripts/ | |
| ├── __init__.py | |
| ├── train.py # python -m scripts.train --config configs/base.yaml | |
| ├── evaluate.py # BLEU-4 on val split, optional Markdown report | |
| ├── predict.py # CLI single-image inference | |
| └── notebook_module_audit.py # **Parity gate** — must pass before Phase 1b changes anything | |
| tests/ | |
| ├── __init__.py | |
| ├── conftest.py # autouse seed fixture, tiny corpus fixture | |
| └── unit/ | |
| ├── __init__.py | |
| ├── test_caption_preprocessing.py # 7 parametrised cases vs notebook baseline | |
| ├── test_config.py # default values, validation, env override, YAML loading | |
| ├── test_evaluation.py # BLEU smoke (perfect=100, ragged refs) | |
| ├── test_hashing.py # streaming SHA-256 | |
| ├── test_image_preprocessing.py # output shape + InceptionV3 range | |
| ├── test_splits.py # image-level disjointness, seed reproducibility | |
| └── test_tokenizer.py # fit/save/load round-trip | |
| .paper-notebook.sha256 # Locked notebook hash for `make freeze-paper-notebook` | |
| ``` | |
| ## Migration summary (notebook → modules) | |
| | Notebook cell | Lines extracted to | Behavioural change | | |
| |---|---|---| | |
| | 0 (imports) | spread across modules | none | | |
| | 1 (`BASE_PATH`) | `configs/base.yaml::data.base_path` | none | | |
| | 2 (load COCO) | `data/coco.py::load_coco_annotations` | + path-existence check (early failure); + seedable sampling (was non-deterministic) | | |
| | 3 (caption preprocess) | `preprocessing/caption.py::preprocess_caption` | none — pre-compiled regex for marginal speed | | |
| | 4 (apply preprocess) | done inside `load_coco_annotations` | none | | |
| | 6 (hyperparams) | `config/schema.py` + `configs/base.yaml` | typed and validated; env-overridable | | |
| | 7-9 (tokenizer fit + save) | `preprocessing/tokenizer.py::CaptionTokenizer.fit/.save` | + JSON sidecar for inspection; pickle preserved for compat | | |
| | 10 (StringLookup) | `preprocessing/tokenizer.py::CaptionTokenizer._build_lookups` | none | | |
| | 11 (image-level split) | `data/splits.py::make_image_level_splits` | + seedable; + uses `random.Random(seed)` to avoid mutating module-global RNG | | |
| | 13 (load_data) | `data/pipeline.py::_make_load_data_fn` + `preprocessing/image.py` | none | | |
| | 14 (tf.data) | `data/pipeline.py::build_{train,val}_pipeline` | none — val shuffle preserved for parity | | |
| | 15 (augmentation) | `preprocessing/augmentation.py::default_image_augmentation` | none | | |
| | 16 (CNN_Encoder) | `models/encoder_cnn.py::build_cnn_encoder` | none | | |
| | 17 (TransformerEncoderLayer) | `models/transformer_encoder.py` | none | | |
| | 18 (Embeddings) | `models/embeddings.py` | none | | |
| | 19 (TransformerDecoderLayer) | `models/transformer_decoder.py` | globals → constructor args (`vocab_size`, `max_len`); same defaults | | |
| | 20 (ImageCaptioningModel) | `models/captioning_model.py` | none — `training=True` quirk preserved (commented) | | |
| | 21 (wiring) | `models/factory.py::build_caption_model` | none | | |
| | 22 (compile) | `training/losses.py` + `training/callbacks.py` + `Trainer.compile` | none | | |
| | 23 (fit) | `training/trainer.py::Trainer.fit` | + writes `history.json` if output_dir given | | |
| | 25 (inference) | `inference/{image_loader,greedy,predictor}.py` | globals → arguments (`model`, `tokenizer`, `max_length`) | | |
| | 30 (save_weights) | `scripts/train.py` final step | none | | |
| **No silent behaviour rewrites.** The two intentional, additive changes are | |
| (a) seeds threaded through where the notebook had un-seeded randomness, and | |
| (b) optional output-directory persistence in the `Trainer`. Both are gated | |
| on caller arguments — passing `seed=None` or `output_dir=None` reproduces | |
| notebook behaviour exactly. | |
| ### Behavioural quirks preserved on purpose | |
| These are documented in code comments referencing this section. | |
| 1. **`compute_loss_and_acc` always passes `training=True`** | |
| ([captioning_model.py](../src/captioning/models/captioning_model.py)). | |
| The notebook's `test_step` calls this with `training=False` but the call | |
| ignores the argument and hardcodes `training=True` to the encoder/decoder. | |
| Result: dropout is active during validation in the IEEE results. We | |
| preserve this for parity. Phase 1b will fix it in a clearly-marked commit | |
| *after* the parity gate is green. | |
| 2. **Validation pipeline is shuffled** | |
| ([data/pipeline.py](../src/captioning/data/pipeline.py)). | |
| `build_val_pipeline` mirrors notebook cell 14 and includes `.shuffle()`, | |
| which is technically pointless for validation. Phase 1b removes it. | |
| 3. **Vocabulary closure timing**. | |
| The notebook's `TransformerDecoderLayer.__init__` reads | |
| `tokenizer.vocabulary_size()` from module scope. We require it to be | |
| passed in. Functionally identical when callers pass the right value; | |
| structurally cleaner. | |
| ## Parity validation status | |
| The `scripts/notebook_module_audit.py` script implements **four parity | |
| checks** comparing the modular path against re-implemented notebook cells: | |
| | Stage | Check | Tolerance | | |
| |---|---|---| | |
| | 1 | Caption preprocessing — string equality on 7 edge cases | exact | | |
| | 2 | Tokenizer vocabulary — set + ordering equality on a 20-caption corpus + encoding equality on a held-out caption | exact | | |
| | 3 | Image preprocessing — `tf.allclose` between `Resizing → preprocess_input` two ways | atol=1e-5 | | |
| | 4 | Decoder forward pass — shape + determinism at `training=False` | atol=1e-6 | | |
| **Status:** ⚠️ **Audit is wired up but has not been executed yet.** The | |
| project venv (`.venv/`) is on Python 3.13, which is outside the package | |
| requirement `>=3.10,<3.13`. TensorFlow 2.15 has no 3.13 wheels, so the | |
| runtime deps cannot install in this venv. The user must recreate the venv | |
| on Python 3.10 or 3.11 before the parity gate can run end-to-end. | |
| **Static-only verification done so far:** every Python file passes | |
| `py_compile.compile(..., doraise=True)`. | |
| A *full* BLEU/caption parity test (the kind that runs the IEEE notebook | |
| end-to-end and compares against a checkpoint loaded by the modular path) | |
| requires a trained `model.h5` checkpoint, which doesn't exist in this repo | |
| yet. Once Phase 2 publishes one to HuggingFace Hub, the audit will be | |
| extended with a fifth stage that loads the same weights both ways and | |
| asserts caption equality on a fixed image set. | |
| ## Technical debt remaining | |
| | # | Debt | Where | Phase that addresses it | | |
| |---|---|---|---| | |
| | 1 | `compute_loss_and_acc` ignores `training` parameter | [models/captioning_model.py](../src/captioning/models/captioning_model.py) | 1b | | |
| | 2 | Val pipeline shuffles unnecessarily | [data/pipeline.py](../src/captioning/data/pipeline.py) | 1b | | |
| | 3 | Beam search not implemented (greedy only) | [inference/predictor.py](../src/captioning/inference/predictor.py) | 1b | | |
| | 4 | LR fixed at Adam default; no warmup/cosine | [training/trainer.py](../src/captioning/training/trainer.py) | 1b | | |
| | 5 | Only BLEU; no CIDEr/METEOR/ROUGE | [evaluation/](../src/captioning/evaluation/) | 1b | | |
| | 6 | No GitHub Actions yet (CI runs nothing) | `.github/workflows/` | 2 | | |
| | 7 | No FastAPI app yet | [backend/](../backend/) | 2 | | |
| | 8 | venv on Python 3.13 (incompatible with TF 2.15) | `.venv/` | **immediate — see Recommended next commits** | | |
| | 9 | `models/factory.py` lazily builds modules; class-creation pattern is odd | `models/*.py` (`_build_*_class()` factories) | leaving as-is — it keeps TF out of the import path for unrelated callers | | |
| | 10 | No notebook-vs-trained-checkpoint caption parity test | `scripts/notebook_module_audit.py` | 2 (after first HF Hub upload) | | |
| ## Readiness assessment for Phase 2 (FastAPI integration) | |
| | Phase 2 requirement | Status | | |
| |---|---| | |
| | `CaptionPredictor` is a self-contained class | ✅ — [predictor.py](../src/captioning/inference/predictor.py), `from_artifacts()` is the entry point | | |
| | Model load is decoupled from request handling | ✅ — `from_artifacts()` does the load; `predict_*()` methods are pure functions of inputs | | |
| | Image preprocessing matches training byte-for-byte | ✅ — both paths share `preprocessing.image.preprocess_image_tensor` | | |
| | Tokenizer reload from disk works | ✅ — `CaptionTokenizer.load(directory, vocab_size, max_length)` with vocab.pkl + JSON sidecar | | |
| | Config validated at boot | ✅ — Pydantic `AppConfig` raises clearly on missing/typo'd fields | | |
| | Structured logging | ✅ — `utils.logging` emits JSON in production | | |
| | Warmup hook for first-request latency | ✅ — `predictor.warmup()` runs one dummy inference | | |
| | Singleton-friendly | ✅ — caller holds the instance; FastAPI `lifespan` will own one | | |
| | **Blocker for Phase 2:** trained `model.h5` available somewhere | ❌ — must train (or import from Kaggle notebook) before backend can serve a real caption | | |
| **Verdict: package is structurally ready for Phase 2.** The remaining | |
| gating item is producing or importing a `model.h5` checkpoint. Two paths: | |
| 1. **Re-train locally** — `python -m scripts.train --config configs/base.yaml` | |
| (requires COCO downloaded into `data/coco2017/`; ~12-18 hrs on CPU). | |
| 2. **Import from Kaggle** — the existing IEEE notebook on Kaggle can be re-run | |
| to produce `model.h5` + `vocab_coco.file`, then uploaded to HuggingFace | |
| Hub. This is the recommended path because it preserves the published BLEU. | |
| ## Recommended next commits | |
| Order matters: each commit should be reviewable in isolation. Break Phase 1 | |
| into the following sequence (one logical change per commit): | |
| ``` | |
| 1. chore(venv): document Python 3.10 requirement; add setup script | |
| 2. feat(utils): structured logging, seed, sha256 helpers | |
| 3. feat(config): Pydantic v2 schema + YAML loader | |
| 4. feat(preprocessing): caption + image transforms + CaptionTokenizer wrapper | |
| 5. feat(data): COCO loader, image-level splits, tf.data pipelines | |
| 6. feat(models): CNN encoder, Transformer encoder/decoder, captioning model, factory | |
| 7. feat(training): loss + callbacks + Trainer.fit | |
| 8. feat(inference): greedy generation + CaptionPredictor singleton | |
| 9. feat(evaluation): corpus BLEU-4 via sacrebleu | |
| 10. feat(scripts): train, evaluate, predict CLI entry points | |
| 11. test: unit tests for pure functions and TF-dependent smoke checks | |
| 12. feat(parity): notebook-module audit script gating Phase 1b changes | |
| 13. chore(notebook): lock paper-notebook hash for freeze CI check | |
| 14. docs: Phase 1 closeout (this file) | |
| ``` | |
| A single feature-branch PR (`feat/phase-1-modularisation`) collapsing all of | |
| the above is also acceptable — recruiter-grade reviewers will want to see | |
| the migration table, parity audit, and tests in one place. | |
| ### Suggested commit messages (verbatim) | |
| ``` | |
| chore(venv): pin Python to 3.10 and document setup | |
| The Phase 0 venv was created on Python 3.13, which has no | |
| tensorflow-cpu==2.15.0 wheels and falls outside the package | |
| requirement (>=3.10,<3.13). Recreate with: | |
| py -3.10 -m venv .venv | |
| .venv\Scripts\activate | |
| pip install -r requirements-dev.txt -r requirements-eval.txt | |
| pip install -e ".[hf,mlflow]" | |
| ``` | |
| ``` | |
| feat(captioning): extract IEEE notebook into modular package | |
| Lifts every line of notebooks/01_ieee_inceptionv3_transformer.ipynb into | |
| src/captioning/ behind a parity validation gate. Mirrors the notebook's | |
| behaviour byte-for-byte at fixed seeds; intentional additive improvements | |
| (seeded sampling, output-dir persistence, JSON vocab sidecar) are gated on | |
| caller arguments and disabled by default. | |
| Sub-packages: | |
| config/ Pydantic v2 schema + YAML loader | |
| preprocessing/ caption + image transforms + CaptionTokenizer wrapper | |
| data/ COCO loader + image-level splits + tf.data pipelines | |
| models/ CNN encoder + Transformer encoder/decoder + factory | |
| training/ loss + callbacks + Trainer | |
| inference/ greedy generation + CaptionPredictor singleton | |
| evaluation/ corpus BLEU-4 via sacrebleu | |
| utils/ structured logging + seed + sha256 | |
| Adds CLI entry points (scripts/{train,evaluate,predict}.py), a parity | |
| audit (scripts/notebook_module_audit.py), and a unit test suite covering | |
| all pure-Python paths. The Predictor exposes from_artifacts() and | |
| warmup() so Phase 2's FastAPI lifespan can wire it in unchanged. | |
| ``` | |
| ``` | |
| test(captioning): unit tests for pure modules + tokenizer round-trip | |
| Covers caption preprocessing (parametrised vs notebook baseline), | |
| config schema (defaults, validation, env override, YAML loading), | |
| image-level splits (disjointness, seed reproducibility, int truncation), | |
| hashing (stream vs one-shot equality), evaluation (perfect=100, ragged | |
| refs, length mismatch raises), tokenizer (fit/save/load round-trip, | |
| unfitted-error contract), image preprocessing (shape + range). | |
| TF-dependent tests use pytest.importorskip; pure-Python tests need no | |
| ML deps and are CI-runnable in <5s. | |
| ``` | |
| ``` | |
| feat(parity): notebook-module audit gating Phase 1b changes | |
| Four-stage parity check: caption preprocessing (exact), tokenizer | |
| vocabulary (set + ordering + encoding equality), image preprocessing | |
| (tf.allclose, atol=1e-5), decoder forward pass (shape + determinism at | |
| training=False). Each stage re-implements the relevant notebook cell | |
| inline so the ground truth is colocated with the test. Synthetic inputs | |
| let the audit run in seconds without needing the real COCO dataset. | |
| Run: python -m scripts.notebook_module_audit | |
| ``` | |
| ``` | |
| chore(notebook): lock paper-notebook hash for freeze CI check | |
| Adds .paper-notebook.sha256 with the SHA-256 of | |
| notebooks/01_ieee_inceptionv3_transformer.ipynb at the time of Phase 1 | |
| modularisation. The `make freeze-paper-notebook` target asserts this | |
| hash on every CI run; any byte change to the notebook fails the check. | |
| Phase 4 wires this into a required GitHub Actions status check on main. | |
| ``` | |
| ``` | |
| docs: Phase 1 closeout (modularisation complete) | |
| Migration table (notebook cell → module), parity validation status, | |
| preserved behavioural quirks, technical debt remaining, readiness | |
| assessment for Phase 2 FastAPI integration. Documents the venv setup | |
| gap (Python 3.13 vs project requirement 3.10/3.11) as the single | |
| remaining blocker before the parity audit can execute end-to-end. | |
| ``` | |
| ## Verification checklist (run before tagging Phase 1) | |
| ```powershell | |
| # 1. Recreate the venv with a supported Python (3.10 or 3.11). | |
| py -3.10 -m venv .venv | |
| .venv\Scripts\activate | |
| pip install -r requirements-dev.txt -r requirements-eval.txt | |
| pip install -e ".[hf,mlflow]" | |
| # 2. Run static checks. | |
| ruff check src/captioning scripts tests | |
| ruff format --check src/captioning scripts tests | |
| mypy src/captioning scripts | |
| # 3. Run unit tests. | |
| pytest tests/ -v | |
| # 4. Run the parity audit (the gate). | |
| python -m scripts.notebook_module_audit | |
| # 5. Verify the paper notebook is byte-stable. | |
| make freeze-paper-notebook | |
| ``` | |
| All five must pass green before merging Phase 1 and starting Phase 2. | |