Spaces:

apoorvrajdev
/

image-captioning-api

Configuration error

App Files Files Community

image-captioning-api / docs /PHASE_1_NOTES.md

apoorvrajdev

feat: finalize Phase 1 modular ML architecture

3a2e5f0 27 days ago

preview code

raw

history blame contribute delete

18.7 kB

	# Phase 1 — Modularisation (closeout)

	> Phase 1 lifts every line of code out of the IEEE notebook into a proper
	> Python package, behind a parity validation gate. No behaviour changes —
	> the same hyperparameters, the same TF ops, the same losses, the same
	> generation algorithm. What changes is structure: testable, reusable, and
	> ready for FastAPI to import directly in Phase 2.

	## Updated folder structure

	```
	src/captioning/
	├── __init__.py # Public API + version
	├── py.typed # PEP 561 marker — package ships type hints
	│
	├── config/ # Typed configuration (Pydantic v2)
	│ ├── __init__.py
	│ ├── schema.py # AppConfig, ModelConfig, TrainConfig, DataConfig, ServeConfig
	│ └── loader.py # load_config(yaml_path) -> AppConfig
	│
	├── preprocessing/ # Pure, stateless transforms (TRAIN ↔ SERVE shared)
	│ ├── __init__.py
	│ ├── caption.py # preprocess_caption — notebook cell 3
	│ ├── image.py # preprocess_image_tensor + load_and_preprocess_image
	│ ├── tokenizer.py # CaptionTokenizer (wraps TextVectorization)
	│ └── augmentation.py # default_image_augmentation — notebook cell 15
	│
	├── data/ # Stateful: I/O + dataset construction
	│ ├── __init__.py
	│ ├── coco.py # load_coco_annotations — notebook cell 2
	│ ├── splits.py # make_image_level_splits — notebook cell 11
	│ └── pipeline.py # build_train/val_pipeline — notebook cells 13-14
	│
	├── models/ # Architecture (TF/Keras layers + top-level model)
	│ ├── __init__.py
	│ ├── encoder_cnn.py # InceptionV3 backbone — notebook cell 16
	│ ├── transformer_encoder.py # 1-layer encoder — notebook cell 17
	│ ├── embeddings.py # token + positional — notebook cell 18
	│ ├── transformer_decoder.py # multi-head causal decoder — notebook cell 19
	│ ├── captioning_model.py # ImageCaptioningModel — notebook cell 20
	│ └── factory.py # build_caption_model(config, vocab_size) — notebook cell 21
	│
	├── training/ # Loss, callbacks, orchestration
	│ ├── __init__.py
	│ ├── losses.py # masked_sparse_categorical_crossentropy — notebook cell 22
	│ ├── callbacks.py # EarlyStopping (+ Phase 1b ModelCheckpoint, CSVLogger)
	│ └── trainer.py # Trainer.fit — notebook cell 23
	│
	├── inference/ # Generation + FastAPI-friendly singleton
	│ ├── __init__.py
	│ ├── image_loader.py # load_image_from_path — notebook cell 25
	│ ├── greedy.py # generate_caption_greedy — notebook cell 25
	│ └── predictor.py # CaptionPredictor (Phase 2 FastAPI imports this)
	│
	├── evaluation/ # Caption-quality metrics
	│ ├── __init__.py
	│ └── bleu.py # corpus BLEU-4 via sacrebleu (Phase 1b adds CIDEr/METEOR/ROUGE)
	│
	└── utils/ # Cross-cutting helpers
	├── __init__.py
	├── logging.py # structlog (JSON in prod, pretty in dev)
	├── seed.py # set_global_seed
	└── hashing.py # sha256_file (paper-notebook freeze)

	configs/
	├── base.yaml # Mirrors notebook cell 6 hyperparams
	└── train/debug.yaml # CI smoke override (1 epoch, batch 8)

	scripts/
	├── __init__.py
	├── train.py # python -m scripts.train --config configs/base.yaml
	├── evaluate.py # BLEU-4 on val split, optional Markdown report
	├── predict.py # CLI single-image inference
	└── notebook_module_audit.py # Parity gate — must pass before Phase 1b changes anything

	tests/
	├── __init__.py
	├── conftest.py # autouse seed fixture, tiny corpus fixture
	└── unit/
	├── __init__.py
	├── test_caption_preprocessing.py # 7 parametrised cases vs notebook baseline
	├── test_config.py # default values, validation, env override, YAML loading
	├── test_evaluation.py # BLEU smoke (perfect=100, ragged refs)
	├── test_hashing.py # streaming SHA-256
	├── test_image_preprocessing.py # output shape + InceptionV3 range
	├── test_splits.py # image-level disjointness, seed reproducibility
	└── test_tokenizer.py # fit/save/load round-trip

	.paper-notebook.sha256 # Locked notebook hash for `make freeze-paper-notebook`
	```

	## Migration summary (notebook → modules)

	\| Notebook cell \| Lines extracted to \| Behavioural change \|
	\|---\|---\|---\|
	\| 0 (imports) \| spread across modules \| none \|
	\| 1 (`BASE_PATH`) \| `configs/base.yaml::data.base_path` \| none \|
	\| 2 (load COCO) \| `data/coco.py::load_coco_annotations` \| + path-existence check (early failure); + seedable sampling (was non-deterministic) \|
	\| 3 (caption preprocess) \| `preprocessing/caption.py::preprocess_caption` \| none — pre-compiled regex for marginal speed \|
	\| 4 (apply preprocess) \| done inside `load_coco_annotations` \| none \|
	\| 6 (hyperparams) \| `config/schema.py` + `configs/base.yaml` \| typed and validated; env-overridable \|
	\| 7-9 (tokenizer fit + save) \| `preprocessing/tokenizer.py::CaptionTokenizer.fit/.save` \| + JSON sidecar for inspection; pickle preserved for compat \|
	\| 10 (StringLookup) \| `preprocessing/tokenizer.py::CaptionTokenizer._build_lookups` \| none \|
	\| 11 (image-level split) \| `data/splits.py::make_image_level_splits` \| + seedable; + uses `random.Random(seed)` to avoid mutating module-global RNG \|
	\| 13 (load_data) \| `data/pipeline.py::_make_load_data_fn` + `preprocessing/image.py` \| none \|
	\| 14 (tf.data) \| `data/pipeline.py::build_{train,val}_pipeline` \| none — val shuffle preserved for parity \|
	\| 15 (augmentation) \| `preprocessing/augmentation.py::default_image_augmentation` \| none \|
	\| 16 (CNN_Encoder) \| `models/encoder_cnn.py::build_cnn_encoder` \| none \|
	\| 17 (TransformerEncoderLayer) \| `models/transformer_encoder.py` \| none \|
	\| 18 (Embeddings) \| `models/embeddings.py` \| none \|
	\| 19 (TransformerDecoderLayer) \| `models/transformer_decoder.py` \| globals → constructor args (`vocab_size`, `max_len`); same defaults \|
	\| 20 (ImageCaptioningModel) \| `models/captioning_model.py` \| none — `training=True` quirk preserved (commented) \|
	\| 21 (wiring) \| `models/factory.py::build_caption_model` \| none \|
	\| 22 (compile) \| `training/losses.py` + `training/callbacks.py` + `Trainer.compile` \| none \|
	\| 23 (fit) \| `training/trainer.py::Trainer.fit` \| + writes `history.json` if output_dir given \|
	\| 25 (inference) \| `inference/{image_loader,greedy,predictor}.py` \| globals → arguments (`model`, `tokenizer`, `max_length`) \|
	\| 30 (save_weights) \| `scripts/train.py` final step \| none \|

	No silent behaviour rewrites. The two intentional, additive changes are
	(a) seeds threaded through where the notebook had un-seeded randomness, and
	(b) optional output-directory persistence in the `Trainer`. Both are gated
	on caller arguments — passing `seed=None` or `output_dir=None` reproduces
	notebook behaviour exactly.

	### Behavioural quirks preserved on purpose

	These are documented in code comments referencing this section.

	1. `compute_loss_and_acc` always passes `training=True`
	([captioning_model.py](../src/captioning/models/captioning_model.py)).
	The notebook's `test_step` calls this with `training=False` but the call
	ignores the argument and hardcodes `training=True` to the encoder/decoder.
	Result: dropout is active during validation in the IEEE results. We
	preserve this for parity. Phase 1b will fix it in a clearly-marked commit
	after the parity gate is green.

	2. Validation pipeline is shuffled
	([data/pipeline.py](../src/captioning/data/pipeline.py)).
	`build_val_pipeline` mirrors notebook cell 14 and includes `.shuffle()`,
	which is technically pointless for validation. Phase 1b removes it.

	3. Vocabulary closure timing.
	The notebook's `TransformerDecoderLayer.__init__` reads
	`tokenizer.vocabulary_size()` from module scope. We require it to be
	passed in. Functionally identical when callers pass the right value;
	structurally cleaner.

	## Parity validation status

	The `scripts/notebook_module_audit.py` script implements **four parity
	checks** comparing the modular path against re-implemented notebook cells:

	\| Stage \| Check \| Tolerance \|
	\|---\|---\|---\|
	\| 1 \| Caption preprocessing — string equality on 7 edge cases \| exact \|
	\| 2 \| Tokenizer vocabulary — set + ordering equality on a 20-caption corpus + encoding equality on a held-out caption \| exact \|
	\| 3 \| Image preprocessing — `tf.allclose` between `Resizing → preprocess_input` two ways \| atol=1e-5 \|
	\| 4 \| Decoder forward pass — shape + determinism at `training=False` \| atol=1e-6 \|

	Status: ⚠️ Audit is wired up but has not been executed yet. The
	project venv (`.venv/`) is on Python 3.13, which is outside the package
	requirement `>=3.10,<3.13`. TensorFlow 2.15 has no 3.13 wheels, so the
	runtime deps cannot install in this venv. The user must recreate the venv
	on Python 3.10 or 3.11 before the parity gate can run end-to-end.
	Static-only verification done so far: every Python file passes
	`py_compile.compile(..., doraise=True)`.

	A full BLEU/caption parity test (the kind that runs the IEEE notebook
	end-to-end and compares against a checkpoint loaded by the modular path)
	requires a trained `model.h5` checkpoint, which doesn't exist in this repo
	yet. Once Phase 2 publishes one to HuggingFace Hub, the audit will be
	extended with a fifth stage that loads the same weights both ways and
	asserts caption equality on a fixed image set.

	## Technical debt remaining

	\| # \| Debt \| Where \| Phase that addresses it \|
	\|---\|---\|---\|---\|
	\| 1 \| `compute_loss_and_acc` ignores `training` parameter \| [models/captioning_model.py](../src/captioning/models/captioning_model.py) \| 1b \|
	\| 2 \| Val pipeline shuffles unnecessarily \| [data/pipeline.py](../src/captioning/data/pipeline.py) \| 1b \|
	\| 3 \| Beam search not implemented (greedy only) \| [inference/predictor.py](../src/captioning/inference/predictor.py) \| 1b \|
	\| 4 \| LR fixed at Adam default; no warmup/cosine \| [training/trainer.py](../src/captioning/training/trainer.py) \| 1b \|
	\| 5 \| Only BLEU; no CIDEr/METEOR/ROUGE \| [evaluation/](../src/captioning/evaluation/) \| 1b \|
	\| 6 \| No GitHub Actions yet (CI runs nothing) \| `.github/workflows/` \| 2 \|
	\| 7 \| No FastAPI app yet \| [backend/](../backend/) \| 2 \|
	\| 8 \| venv on Python 3.13 (incompatible with TF 2.15) \| `.venv/` \| immediate — see Recommended next commits \|
	\| 9 \| `models/factory.py` lazily builds modules; class-creation pattern is odd \| `models/.py` (`_build__class()` factories) \| leaving as-is — it keeps TF out of the import path for unrelated callers \|
	\| 10 \| No notebook-vs-trained-checkpoint caption parity test \| `scripts/notebook_module_audit.py` \| 2 (after first HF Hub upload) \|

	## Readiness assessment for Phase 2 (FastAPI integration)

	\| Phase 2 requirement \| Status \|
	\|---\|---\|
	\| `CaptionPredictor` is a self-contained class \| ✅ — [predictor.py](../src/captioning/inference/predictor.py), `from_artifacts()` is the entry point \|
	\| Model load is decoupled from request handling \| ✅ — `from_artifacts()` does the load; `predict_*()` methods are pure functions of inputs \|
	\| Image preprocessing matches training byte-for-byte \| ✅ — both paths share `preprocessing.image.preprocess_image_tensor` \|
	\| Tokenizer reload from disk works \| ✅ — `CaptionTokenizer.load(directory, vocab_size, max_length)` with vocab.pkl + JSON sidecar \|
	\| Config validated at boot \| ✅ — Pydantic `AppConfig` raises clearly on missing/typo'd fields \|
	\| Structured logging \| ✅ — `utils.logging` emits JSON in production \|
	\| Warmup hook for first-request latency \| ✅ — `predictor.warmup()` runs one dummy inference \|
	\| Singleton-friendly \| ✅ — caller holds the instance; FastAPI `lifespan` will own one \|
	\| Blocker for Phase 2: trained `model.h5` available somewhere \| ❌ — must train (or import from Kaggle notebook) before backend can serve a real caption \|

	Verdict: package is structurally ready for Phase 2. The remaining
	gating item is producing or importing a `model.h5` checkpoint. Two paths:

	1. Re-train locally — `python -m scripts.train --config configs/base.yaml`
	(requires COCO downloaded into `data/coco2017/`; ~12-18 hrs on CPU).
	2. Import from Kaggle — the existing IEEE notebook on Kaggle can be re-run
	to produce `model.h5` + `vocab_coco.file`, then uploaded to HuggingFace
	Hub. This is the recommended path because it preserves the published BLEU.

	## Recommended next commits

	Order matters: each commit should be reviewable in isolation. Break Phase 1
	into the following sequence (one logical change per commit):

	```
	1. chore(venv): document Python 3.10 requirement; add setup script
	2. feat(utils): structured logging, seed, sha256 helpers
	3. feat(config): Pydantic v2 schema + YAML loader
	4. feat(preprocessing): caption + image transforms + CaptionTokenizer wrapper
	5. feat(data): COCO loader, image-level splits, tf.data pipelines
	6. feat(models): CNN encoder, Transformer encoder/decoder, captioning model, factory
	7. feat(training): loss + callbacks + Trainer.fit
	8. feat(inference): greedy generation + CaptionPredictor singleton
	9. feat(evaluation): corpus BLEU-4 via sacrebleu
	10. feat(scripts): train, evaluate, predict CLI entry points
	11. test: unit tests for pure functions and TF-dependent smoke checks
	12. feat(parity): notebook-module audit script gating Phase 1b changes
	13. chore(notebook): lock paper-notebook hash for freeze CI check
	14. docs: Phase 1 closeout (this file)
	```

	A single feature-branch PR (`feat/phase-1-modularisation`) collapsing all of
	the above is also acceptable — recruiter-grade reviewers will want to see
	the migration table, parity audit, and tests in one place.

	### Suggested commit messages (verbatim)

	```
	chore(venv): pin Python to 3.10 and document setup

	The Phase 0 venv was created on Python 3.13, which has no
	tensorflow-cpu==2.15.0 wheels and falls outside the package
	requirement (>=3.10,<3.13). Recreate with:

	py -3.10 -m venv .venv
	.venv\Scripts\activate
	pip install -r requirements-dev.txt -r requirements-eval.txt
	pip install -e ".[hf,mlflow]"
	```

	```
	feat(captioning): extract IEEE notebook into modular package

	Lifts every line of notebooks/01_ieee_inceptionv3_transformer.ipynb into
	src/captioning/ behind a parity validation gate. Mirrors the notebook's
	behaviour byte-for-byte at fixed seeds; intentional additive improvements
	(seeded sampling, output-dir persistence, JSON vocab sidecar) are gated on
	caller arguments and disabled by default.

	Sub-packages:
	config/ Pydantic v2 schema + YAML loader
	preprocessing/ caption + image transforms + CaptionTokenizer wrapper
	data/ COCO loader + image-level splits + tf.data pipelines
	models/ CNN encoder + Transformer encoder/decoder + factory
	training/ loss + callbacks + Trainer
	inference/ greedy generation + CaptionPredictor singleton
	evaluation/ corpus BLEU-4 via sacrebleu
	utils/ structured logging + seed + sha256

	Adds CLI entry points (scripts/{train,evaluate,predict}.py), a parity
	audit (scripts/notebook_module_audit.py), and a unit test suite covering
	all pure-Python paths. The Predictor exposes from_artifacts() and
	warmup() so Phase 2's FastAPI lifespan can wire it in unchanged.
	```

	```
	test(captioning): unit tests for pure modules + tokenizer round-trip

	Covers caption preprocessing (parametrised vs notebook baseline),
	config schema (defaults, validation, env override, YAML loading),
	image-level splits (disjointness, seed reproducibility, int truncation),
	hashing (stream vs one-shot equality), evaluation (perfect=100, ragged
	refs, length mismatch raises), tokenizer (fit/save/load round-trip,
	unfitted-error contract), image preprocessing (shape + range).

	TF-dependent tests use pytest.importorskip; pure-Python tests need no
	ML deps and are CI-runnable in <5s.
	```

	```
	feat(parity): notebook-module audit gating Phase 1b changes

	Four-stage parity check: caption preprocessing (exact), tokenizer
	vocabulary (set + ordering + encoding equality), image preprocessing
	(tf.allclose, atol=1e-5), decoder forward pass (shape + determinism at
	training=False). Each stage re-implements the relevant notebook cell
	inline so the ground truth is colocated with the test. Synthetic inputs
	let the audit run in seconds without needing the real COCO dataset.

	Run: python -m scripts.notebook_module_audit
	```

	```
	chore(notebook): lock paper-notebook hash for freeze CI check

	Adds .paper-notebook.sha256 with the SHA-256 of
	notebooks/01_ieee_inceptionv3_transformer.ipynb at the time of Phase 1
	modularisation. The `make freeze-paper-notebook` target asserts this
	hash on every CI run; any byte change to the notebook fails the check.
	Phase 4 wires this into a required GitHub Actions status check on main.
	```

	```
	docs: Phase 1 closeout (modularisation complete)

	Migration table (notebook cell → module), parity validation status,
	preserved behavioural quirks, technical debt remaining, readiness
	assessment for Phase 2 FastAPI integration. Documents the venv setup
	gap (Python 3.13 vs project requirement 3.10/3.11) as the single
	remaining blocker before the parity audit can execute end-to-end.
	```

	## Verification checklist (run before tagging Phase 1)

	```powershell
	# 1. Recreate the venv with a supported Python (3.10 or 3.11).
	py -3.10 -m venv .venv
	.venv\Scripts\activate
	pip install -r requirements-dev.txt -r requirements-eval.txt
	pip install -e ".[hf,mlflow]"

	# 2. Run static checks.
	ruff check src/captioning scripts tests
	ruff format --check src/captioning scripts tests
	mypy src/captioning scripts

	# 3. Run unit tests.
	pytest tests/ -v

	# 4. Run the parity audit (the gate).
	python -m scripts.notebook_module_audit

	# 5. Verify the paper notebook is byte-stable.
	make freeze-paper-notebook
	```

	All five must pass green before merging Phase 1 and starting Phase 2.