image-captioning-api / docs /restructure-plan.md
apoorvrajdev's picture
feat: bootstrap production-grade ML repository tooling
b2594db
# Production Restructuring Plan
> Public, in-repo copy of the engineering plan that drives the transition from
> a single-notebook research project into a deployable multimodal AI platform.
> The original (with internal exploration notes) lives in the developer's
> `~/.claude/plans/` directory; this version is the canonical public artefact.
## Context
This repository is the engineering home of an IEEE-published image-captioning
research project. The published artefact is a single Jupyter notebook
([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb))
implementing **InceptionV3 (frozen) + custom Keras Transformer decoder**
trained on **COCO 2017**, reporting **BLEU ~24**.
**Goal**: convert the repo into a recruiter-grade, production-style
multimodal AI platform with a live free-tier demo, while **preserving the
IEEE notebook byte-for-byte** as the canonical research artefact.
**Constraints**:
- Hosting budget: **$0/month** β†’ HuggingFace Spaces (backend) + Vercel free
(frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
- Multimodal scope (v1): **Tier 1 only** β€” add three pretrained HuggingFace
models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison
demo. Tier 2/3/4 are listed under *Future work* only.
---
## 1. Folder Structure (target)
```
image-captioning-system/
β”œβ”€β”€ notebooks/
β”‚ └── 01_ieee_inceptionv3_transformer.ipynb # FROZEN
β”œβ”€β”€ src/captioning/ # Installable Python package
β”‚ β”œβ”€β”€ config/ # Pydantic settings + YAML loader
β”‚ β”œβ”€β”€ data/ # COCO loaders, preprocess, splits
β”‚ β”œβ”€β”€ tokenizer/ # CaptionTokenizer (Keras TextVectorization wrapper)
β”‚ β”œβ”€β”€ models/ # CNN encoder, Transformer decoder, factory
β”‚ β”œβ”€β”€ training/ # Trainer, losses, metrics, callbacks
β”‚ β”œβ”€β”€ inference/ # Greedy + beam search predictors
β”‚ β”œβ”€β”€ evaluation/ # BLEU, CIDEr, METEOR, ROUGE
β”‚ β”œβ”€β”€ io/ # Checkpoints, image decoding, HF Hub I/O
β”‚ └── utils/ # Logging, seeding, timing
β”œβ”€β”€ configs/ # YAML hyperparameters (validated by Pydantic)
β”œβ”€β”€ scripts/ # CLI entrypoints (train, eval, predict, upload)
β”œβ”€β”€ models/ # Local checkpoint registry (gitignored content)
β”œβ”€β”€ backend/ # FastAPI service (depends on src/captioning)
β”œβ”€β”€ frontend/ # Next.js 14 + TypeScript + Tailwind + shadcn/ui
β”œβ”€β”€ tests/ # ML-core tests (unit + integration)
β”œβ”€β”€ docs/ # Architecture, ADRs, results, deployment
β”œβ”€β”€ .github/workflows/ # CI, CD, model-eval
β”œβ”€β”€ docker-compose.yml # Local dev: backend + frontend + mlflow
β”œβ”€β”€ pyproject.toml # Single source of truth for the package
└── Makefile # Discoverable command index
```
**Key architectural rules**:
- `src/captioning/` is the ML core; `backend/app/` imports from it. Never
reverse the dependency.
- The IEEE notebook is **frozen** β€” `make freeze-paper-notebook` is a CI
check that fails on any byte change.
- Model weights are **never committed**; they live in HuggingFace Hub
(`yourname/captioning-weights`) and are downloaded at backend startup.
- Configuration is **YAML files validated by Pydantic v2 BaseSettings**, not
Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax.
---
## 2. Migration Strategy
**Approach: verbatim refactor first, improvements second.** Reproducibility
of the IEEE BLEU score is non-negotiable; behaviour parity must be proven
*before* any improvement is made.
### Phase 1a β€” "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook)
| Step | Notebook cell | Target module |
|---|---|---|
| 1 | Hyperparams | `configs/base.yaml` + `src/captioning/config/schema.py` |
| 2 | Caption preprocess | `data/preprocess.py::preprocess_caption` |
| 3 | COCO loader | `data/coco.py::load_coco_annotations` |
| 4 | Tokenizer | `tokenizer/vectorizer.py::CaptionTokenizer` |
| 5 | Splits | `data/splits.py::make_splits(seed=...)` |
| 6 | Image preprocess | `data/preprocess.py::preprocess_image` |
| 7 | tf.data pipeline | `data/pipeline.py::build_{train,val}_pipeline` |
| 8 | Augmentation | `data/augmentation.py::default_augmentation` |
| 9 | InceptionV3 encoder | `models/encoder_cnn.py` |
| 10 | Transformer encoder | `models/transformer_encoder.py` |
| 11 | Embeddings | `models/embeddings.py` |
| 12 | Transformer decoder | `models/transformer_decoder.py` |
| 13 | Captioning model | `models/captioning_model.py` |
| 14 | Wiring | `models/factory.py::build_caption_model(config)` |
| 15 | Loss + compile | `training/losses.py` + `training/trainer.py` |
| 16 | Fit | `training/trainer.py::Trainer.fit` |
| 17 | Inference | `inference/greedy.py`, `inference/predictor.py` |
| 18 | Save weights | `io/checkpoints.py` + `scripts/train.py` |
### Parity validation gate
`scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image
fixture and asserts:
- Tokenizer vocabulary identical (set equality).
- Image preprocessing tensor-equal (`np.allclose`, atol=1e-5).
- Model output logits equal at fixed weights (atol=1e-4).
- Captions on 20 fixed images byte-equal between notebook and module path.
### Phase 1b β€” Quality improvements (only after parity is green)
1. Masked accuracy metric (notebook tracks loss only).
2. Beam search inference.
3. Warmup + cosine LR schedule (replaces bare Adam).
4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
5. `vocab.json` sidecar alongside `vocab.pkl`.
6. Label smoothing.
---
## 3. Implementation Roadmap
| Phase | Deliverable | Effort | Recruiter signal |
|---|---|---|---|
| **0** | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 |
| **1** | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via `docker compose up` |
| **2** | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn |
| **3** | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share |
| **4** | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off |
### Future work (out of scope for v1)
- **Tier 2**: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β†’ 32+).
- **Tier 3**: Anthropic Claude vision endpoint as a "Frontier" tab.
- **Tier 4**: VQA "Ask the image" extension reusing Tier 3 infra.
- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.
---
## 4. Deployment Stack (free-tier)
| Layer | Service | Why |
|---|---|---|
| Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable |
| Frontend hosting | Vercel free | Next.js native; per-PR preview URLs |
| Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards |
| Experiment tracking | MLflow on DagsHub free | Public read-only tracking server |
| Errors | Sentry free (5k errors/mo) | |
| Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper |
| Domain | None (use `*.hf.space` and `*.vercel.app`) | $0 budget |
---
## 5. Trade-offs Decided
| Decision | Alternative rejected | Reason |
|---|---|---|
| FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan |
| Next.js 14 App Router | Streamlit | Streamlit screams "research demo" |
| TanStack Query | Redux | Server state belongs in a server-state lib |
| YAML + Pydantic | Hydra | Hydra is overkill for 1–3 active configs |
| MLflow on DagsHub | W&B | DagsHub public free; no recruiter login |
| Keep TextVectorization | HF tokenizer in v1 | Changes vocab β†’ breaks paper parity |
| Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable |
| `tensorflow-cpu==2.15.0` pinned | Floating TF | TF 2.16 broke Keras 2 compat with notebook |
| HF Spaces backend | Fly.io paid | Free-tier-only constraint |
| Multipart uploads | Base64 in JSON | 33% overhead, no streaming |
| `--workers 1` uvicorn | Multi-worker | TF graph + InceptionV3 Γ—N OOMs |
| Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work |
---
## 6. Verification Plan
**Phase 1**:
- `pytest tests/ -v` β†’ all green; coverage β‰₯ 70% on `src/captioning/`.
- `python scripts/notebook_module_audit.py` β†’ parity assertions all pass.
- `docker compose up` β†’ `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions`
returns valid caption JSON.
**Phase 2**:
- GitHub Actions `ci.yml` green on a PR.
- HF Space URL serves `/v1/model/info`.
- Vercel preview URL renders frontend; uploading a sample image returns a caption.
**Phase 3**:
- `GET /v1/models` returns 4 entries.
- `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU.
- `model-eval.yml` posts a BLEU comparison comment on a test PR.
**Phase 4**:
- `/metrics` exposes `caption_inference_seconds` histogram.
- DagsHub MLflow link shows β‰₯ 1 logged run with metrics.
- `make freeze-paper-notebook` fails when notebook bytes change; passes when restored.