Spaces:

apoorvrajdev
/

image-captioning-api

Configuration error

App Files Files Community

image-captioning-api / docs /restructure-plan.md

apoorvrajdev

feat: bootstrap production-grade ML repository tooling

b2594db 27 days ago

preview code

raw

history blame contribute delete

9.68 kB

	# Production Restructuring Plan

	> Public, in-repo copy of the engineering plan that drives the transition from
	> a single-notebook research project into a deployable multimodal AI platform.
	> The original (with internal exploration notes) lives in the developer's
	> `~/.claude/plans/` directory; this version is the canonical public artefact.

	## Context

	This repository is the engineering home of an IEEE-published image-captioning
	research project. The published artefact is a single Jupyter notebook
	([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb))
	implementing InceptionV3 (frozen) + custom Keras Transformer decoder
	trained on COCO 2017, reporting BLEU ~24.

	Goal: convert the repo into a recruiter-grade, production-style
	multimodal AI platform with a live free-tier demo, while **preserving the
	IEEE notebook byte-for-byte** as the canonical research artefact.

	Constraints:

	- Hosting budget: $0/month → HuggingFace Spaces (backend) + Vercel free
	(frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
	- Multimodal scope (v1): Tier 1 only — add three pretrained HuggingFace
	models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison
	demo. Tier 2/3/4 are listed under Future work only.

	---

	## 1. Folder Structure (target)

	```
	image-captioning-system/
	├── notebooks/
	│ └── 01_ieee_inceptionv3_transformer.ipynb # FROZEN
	├── src/captioning/ # Installable Python package
	│ ├── config/ # Pydantic settings + YAML loader
	│ ├── data/ # COCO loaders, preprocess, splits
	│ ├── tokenizer/ # CaptionTokenizer (Keras TextVectorization wrapper)
	│ ├── models/ # CNN encoder, Transformer decoder, factory
	│ ├── training/ # Trainer, losses, metrics, callbacks
	│ ├── inference/ # Greedy + beam search predictors
	│ ├── evaluation/ # BLEU, CIDEr, METEOR, ROUGE
	│ ├── io/ # Checkpoints, image decoding, HF Hub I/O
	│ └── utils/ # Logging, seeding, timing
	├── configs/ # YAML hyperparameters (validated by Pydantic)
	├── scripts/ # CLI entrypoints (train, eval, predict, upload)
	├── models/ # Local checkpoint registry (gitignored content)
	├── backend/ # FastAPI service (depends on src/captioning)
	├── frontend/ # Next.js 14 + TypeScript + Tailwind + shadcn/ui
	├── tests/ # ML-core tests (unit + integration)
	├── docs/ # Architecture, ADRs, results, deployment
	├── .github/workflows/ # CI, CD, model-eval
	├── docker-compose.yml # Local dev: backend + frontend + mlflow
	├── pyproject.toml # Single source of truth for the package
	└── Makefile # Discoverable command index
	```

	Key architectural rules:

	- `src/captioning/` is the ML core; `backend/app/` imports from it. Never
	reverse the dependency.
	- The IEEE notebook is frozen — `make freeze-paper-notebook` is a CI
	check that fails on any byte change.
	- Model weights are never committed; they live in HuggingFace Hub
	(`yourname/captioning-weights`) and are downloaded at backend startup.
	- Configuration is YAML files validated by Pydantic v2 BaseSettings, not
	Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax.

	---

	## 2. Migration Strategy

	Approach: verbatim refactor first, improvements second. Reproducibility
	of the IEEE BLEU score is non-negotiable; behaviour parity must be proven
	before any improvement is made.

	### Phase 1a — "Lift and shift" (parity goal: BLEU within ±0.3 of notebook)

	\| Step \| Notebook cell \| Target module \|
	\|---\|---\|---\|
	\| 1 \| Hyperparams \| `configs/base.yaml` + `src/captioning/config/schema.py` \|
	\| 2 \| Caption preprocess \| `data/preprocess.py::preprocess_caption` \|
	\| 3 \| COCO loader \| `data/coco.py::load_coco_annotations` \|
	\| 4 \| Tokenizer \| `tokenizer/vectorizer.py::CaptionTokenizer` \|
	\| 5 \| Splits \| `data/splits.py::make_splits(seed=...)` \|
	\| 6 \| Image preprocess \| `data/preprocess.py::preprocess_image` \|
	\| 7 \| tf.data pipeline \| `data/pipeline.py::build_{train,val}_pipeline` \|
	\| 8 \| Augmentation \| `data/augmentation.py::default_augmentation` \|
	\| 9 \| InceptionV3 encoder \| `models/encoder_cnn.py` \|
	\| 10 \| Transformer encoder \| `models/transformer_encoder.py` \|
	\| 11 \| Embeddings \| `models/embeddings.py` \|
	\| 12 \| Transformer decoder \| `models/transformer_decoder.py` \|
	\| 13 \| Captioning model \| `models/captioning_model.py` \|
	\| 14 \| Wiring \| `models/factory.py::build_caption_model(config)` \|
	\| 15 \| Loss + compile \| `training/losses.py` + `training/trainer.py` \|
	\| 16 \| Fit \| `training/trainer.py::Trainer.fit` \|
	\| 17 \| Inference \| `inference/greedy.py`, `inference/predictor.py` \|
	\| 18 \| Save weights \| `io/checkpoints.py` + `scripts/train.py` \|

	### Parity validation gate

	`scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image
	fixture and asserts:

	- Tokenizer vocabulary identical (set equality).
	- Image preprocessing tensor-equal (`np.allclose`, atol=1e-5).
	- Model output logits equal at fixed weights (atol=1e-4).
	- Captions on 20 fixed images byte-equal between notebook and module path.

	### Phase 1b — Quality improvements (only after parity is green)

	1. Masked accuracy metric (notebook tracks loss only).
	2. Beam search inference.
	3. Warmup + cosine LR schedule (replaces bare Adam).
	4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
	5. `vocab.json` sidecar alongside `vocab.pkl`.
	6. Label smoothing.

	---

	## 3. Implementation Roadmap

	\| Phase \| Deliverable \| Effort \| Recruiter signal \|
	\|---\|---\|---\|---\|
	\| 0 \| Repo bootstrap (this phase) \| 3 hrs \| Clean repo, lint passes from commit 1 \|
	\| 1 \| Modular ML core + backend MVP \| ~15 hrs \| Working FastAPI for the IEEE model, runnable via `docker compose up` \|
	\| 2 \| CI/CD + first deploy (HF Space + Vercel) \| ~12 hrs \| Live demo URL on LinkedIn \|
	\| 3 \| Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo \| ~20 hrs \| The screenshot recruiters share \|
	\| 4 \| Polish + observability (Sentry, Prometheus, ADRs) \| ~8 hrs \| Reads as production-grade, not a research one-off \|

	### Future work (out of scope for v1)

	- Tier 2: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 → 32+).
	- Tier 3: Anthropic Claude vision endpoint as a "Frontier" tab.
	- Tier 4: VQA "Ask the image" extension reusing Tier 3 infra.
	- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.

	---

	## 4. Deployment Stack (free-tier)

	\| Layer \| Service \| Why \|
	\|---\|---\|---\|
	\| Backend hosting \| HuggingFace Spaces (Docker SDK, free CPU) \| 16 GB RAM, ML-native, recruiter-clickable \|
	\| Frontend hosting \| Vercel free \| Next.js native; per-PR preview URLs \|
	\| Model artefacts \| HuggingFace Hub \| Free, unlimited public, versioned, model cards \|
	\| Experiment tracking \| MLflow on DagsHub free \| Public read-only tracking server \|
	\| Errors \| Sentry free (5k errors/mo) \| \|
	\| Uptime \| UptimeRobot free \| Doubles as HF Space wake-up keeper \|
	\| Domain \| None (use `.hf.space` and `.vercel.app`) \| $0 budget \|

	---

	## 5. Trade-offs Decided

	\| Decision \| Alternative rejected \| Reason \|
	\|---\|---\|---\|
	\| FastAPI \| Flask \| Async, OpenAPI, Pydantic, lifespan \|
	\| Next.js 14 App Router \| Streamlit \| Streamlit screams "research demo" \|
	\| TanStack Query \| Redux \| Server state belongs in a server-state lib \|
	\| YAML + Pydantic \| Hydra \| Hydra is overkill for 1–3 active configs \|
	\| MLflow on DagsHub \| W&B \| DagsHub public free; no recruiter login \|
	\| Keep TextVectorization \| HF tokenizer in v1 \| Changes vocab → breaks paper parity \|
	\| Verbatim refactor first \| Clean rewrite \| IEEE BLEU reproducibility non-negotiable \|
	\| `tensorflow-cpu==2.15.0` pinned \| Floating TF \| TF 2.16 broke Keras 2 compat with notebook \|
	\| HF Spaces backend \| Fly.io paid \| Free-tier-only constraint \|
	\| Multipart uploads \| Base64 in JSON \| 33% overhead, no streaming \|
	\| `--workers 1` uvicorn \| Multi-worker \| TF graph + InceptionV3 ×N OOMs \|
	\| Tier 1 only (HF baselines) \| Tier 2/3/4 in v1 \| User selected Tier 1; others as future work \|

	---

	## 6. Verification Plan

	Phase 1:

	- `pytest tests/ -v` → all green; coverage ≥ 70% on `src/captioning/`.
	- `python scripts/notebook_module_audit.py` → parity assertions all pass.
	- `docker compose up` → `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions`
	returns valid caption JSON.

	Phase 2:

	- GitHub Actions `ci.yml` green on a PR.
	- HF Space URL serves `/v1/model/info`.
	- Vercel preview URL renders frontend; uploading a sample image returns a caption.

	Phase 3:

	- `GET /v1/models` returns 4 entries.
	- `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU.
	- `model-eval.yml` posts a BLEU comparison comment on a test PR.

	Phase 4:

	- `/metrics` exposes `caption_inference_seconds` histogram.
	- DagsHub MLflow link shows ≥ 1 logged run with metrics.
	- `make freeze-paper-notebook` fails when notebook bytes change; passes when restored.