image-captioning-api / docs /restructure-plan.md
apoorvrajdev's picture
feat: bootstrap production-grade ML repository tooling
b2594db

Production Restructuring Plan

Public, in-repo copy of the engineering plan that drives the transition from a single-notebook research project into a deployable multimodal AI platform. The original (with internal exploration notes) lives in the developer's ~/.claude/plans/ directory; this version is the canonical public artefact.

Context

This repository is the engineering home of an IEEE-published image-captioning research project. The published artefact is a single Jupyter notebook (notebooks/01_ieee_inceptionv3_transformer.ipynb) implementing InceptionV3 (frozen) + custom Keras Transformer decoder trained on COCO 2017, reporting BLEU ~24.

Goal: convert the repo into a recruiter-grade, production-style multimodal AI platform with a live free-tier demo, while preserving the IEEE notebook byte-for-byte as the canonical research artefact.

Constraints:

  • Hosting budget: $0/month β†’ HuggingFace Spaces (backend) + Vercel free (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
  • Multimodal scope (v1): Tier 1 only β€” add three pretrained HuggingFace models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison demo. Tier 2/3/4 are listed under Future work only.

1. Folder Structure (target)

image-captioning-system/
β”œβ”€β”€ notebooks/
β”‚   └── 01_ieee_inceptionv3_transformer.ipynb   # FROZEN
β”œβ”€β”€ src/captioning/                             # Installable Python package
β”‚   β”œβ”€β”€ config/                                 # Pydantic settings + YAML loader
β”‚   β”œβ”€β”€ data/                                   # COCO loaders, preprocess, splits
β”‚   β”œβ”€β”€ tokenizer/                              # CaptionTokenizer (Keras TextVectorization wrapper)
β”‚   β”œβ”€β”€ models/                                 # CNN encoder, Transformer decoder, factory
β”‚   β”œβ”€β”€ training/                               # Trainer, losses, metrics, callbacks
β”‚   β”œβ”€β”€ inference/                              # Greedy + beam search predictors
β”‚   β”œβ”€β”€ evaluation/                             # BLEU, CIDEr, METEOR, ROUGE
β”‚   β”œβ”€β”€ io/                                     # Checkpoints, image decoding, HF Hub I/O
β”‚   └── utils/                                  # Logging, seeding, timing
β”œβ”€β”€ configs/                                    # YAML hyperparameters (validated by Pydantic)
β”œβ”€β”€ scripts/                                    # CLI entrypoints (train, eval, predict, upload)
β”œβ”€β”€ models/                                     # Local checkpoint registry (gitignored content)
β”œβ”€β”€ backend/                                    # FastAPI service (depends on src/captioning)
β”œβ”€β”€ frontend/                                   # Next.js 14 + TypeScript + Tailwind + shadcn/ui
β”œβ”€β”€ tests/                                      # ML-core tests (unit + integration)
β”œβ”€β”€ docs/                                       # Architecture, ADRs, results, deployment
β”œβ”€β”€ .github/workflows/                          # CI, CD, model-eval
β”œβ”€β”€ docker-compose.yml                          # Local dev: backend + frontend + mlflow
β”œβ”€β”€ pyproject.toml                              # Single source of truth for the package
└── Makefile                                    # Discoverable command index

Key architectural rules:

  • src/captioning/ is the ML core; backend/app/ imports from it. Never reverse the dependency.
  • The IEEE notebook is frozen β€” make freeze-paper-notebook is a CI check that fails on any byte change.
  • Model weights are never committed; they live in HuggingFace Hub (yourname/captioning-weights) and are downloaded at backend startup.
  • Configuration is YAML files validated by Pydantic v2 BaseSettings, not Hydra. Env vars override via CAPTIONING__TRAIN__BATCH_SIZE=32 syntax.

2. Migration Strategy

Approach: verbatim refactor first, improvements second. Reproducibility of the IEEE BLEU score is non-negotiable; behaviour parity must be proven before any improvement is made.

Phase 1a β€” "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook)

Step Notebook cell Target module
1 Hyperparams configs/base.yaml + src/captioning/config/schema.py
2 Caption preprocess data/preprocess.py::preprocess_caption
3 COCO loader data/coco.py::load_coco_annotations
4 Tokenizer tokenizer/vectorizer.py::CaptionTokenizer
5 Splits data/splits.py::make_splits(seed=...)
6 Image preprocess data/preprocess.py::preprocess_image
7 tf.data pipeline data/pipeline.py::build_{train,val}_pipeline
8 Augmentation data/augmentation.py::default_augmentation
9 InceptionV3 encoder models/encoder_cnn.py
10 Transformer encoder models/transformer_encoder.py
11 Embeddings models/embeddings.py
12 Transformer decoder models/transformer_decoder.py
13 Captioning model models/captioning_model.py
14 Wiring models/factory.py::build_caption_model(config)
15 Loss + compile training/losses.py + training/trainer.py
16 Fit training/trainer.py::Trainer.fit
17 Inference inference/greedy.py, inference/predictor.py
18 Save weights io/checkpoints.py + scripts/train.py

Parity validation gate

scripts/notebook_module_audit.py runs both pipelines on a fixed 100-image fixture and asserts:

  • Tokenizer vocabulary identical (set equality).
  • Image preprocessing tensor-equal (np.allclose, atol=1e-5).
  • Model output logits equal at fixed weights (atol=1e-4).
  • Captions on 20 fixed images byte-equal between notebook and module path.

Phase 1b β€” Quality improvements (only after parity is green)

  1. Masked accuracy metric (notebook tracks loss only).
  2. Beam search inference.
  3. Warmup + cosine LR schedule (replaces bare Adam).
  4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
  5. vocab.json sidecar alongside vocab.pkl.
  6. Label smoothing.

3. Implementation Roadmap

Phase Deliverable Effort Recruiter signal
0 Repo bootstrap (this phase) 3 hrs Clean repo, lint passes from commit 1
1 Modular ML core + backend MVP ~15 hrs Working FastAPI for the IEEE model, runnable via docker compose up
2 CI/CD + first deploy (HF Space + Vercel) ~12 hrs Live demo URL on LinkedIn
3 Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo ~20 hrs The screenshot recruiters share
4 Polish + observability (Sentry, Prometheus, ADRs) ~8 hrs Reads as production-grade, not a research one-off

Future work (out of scope for v1)

  • Tier 2: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β†’ 32+).
  • Tier 3: Anthropic Claude vision endpoint as a "Frontier" tab.
  • Tier 4: VQA "Ask the image" extension reusing Tier 3 infra.
  • Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.

4. Deployment Stack (free-tier)

Layer Service Why
Backend hosting HuggingFace Spaces (Docker SDK, free CPU) 16 GB RAM, ML-native, recruiter-clickable
Frontend hosting Vercel free Next.js native; per-PR preview URLs
Model artefacts HuggingFace Hub Free, unlimited public, versioned, model cards
Experiment tracking MLflow on DagsHub free Public read-only tracking server
Errors Sentry free (5k errors/mo)
Uptime UptimeRobot free Doubles as HF Space wake-up keeper
Domain None (use *.hf.space and *.vercel.app) $0 budget

5. Trade-offs Decided

Decision Alternative rejected Reason
FastAPI Flask Async, OpenAPI, Pydantic, lifespan
Next.js 14 App Router Streamlit Streamlit screams "research demo"
TanStack Query Redux Server state belongs in a server-state lib
YAML + Pydantic Hydra Hydra is overkill for 1–3 active configs
MLflow on DagsHub W&B DagsHub public free; no recruiter login
Keep TextVectorization HF tokenizer in v1 Changes vocab β†’ breaks paper parity
Verbatim refactor first Clean rewrite IEEE BLEU reproducibility non-negotiable
tensorflow-cpu==2.15.0 pinned Floating TF TF 2.16 broke Keras 2 compat with notebook
HF Spaces backend Fly.io paid Free-tier-only constraint
Multipart uploads Base64 in JSON 33% overhead, no streaming
--workers 1 uvicorn Multi-worker TF graph + InceptionV3 Γ—N OOMs
Tier 1 only (HF baselines) Tier 2/3/4 in v1 User selected Tier 1; others as future work

6. Verification Plan

Phase 1:

  • pytest tests/ -v β†’ all green; coverage β‰₯ 70% on src/captioning/.
  • python scripts/notebook_module_audit.py β†’ parity assertions all pass.
  • docker compose up β†’ curl -F "file=@sample.jpg" http://localhost:8000/v1/captions returns valid caption JSON.

Phase 2:

  • GitHub Actions ci.yml green on a PR.
  • HF Space URL serves /v1/model/info.
  • Vercel preview URL renders frontend; uploading a sample image returns a caption.

Phase 3:

  • GET /v1/models returns 4 entries.
  • POST /v1/compare returns 4 captions; total latency < 15s on HF Space CPU.
  • model-eval.yml posts a BLEU comparison comment on a test PR.

Phase 4:

  • /metrics exposes caption_inference_seconds histogram.
  • DagsHub MLflow link shows β‰₯ 1 logged run with metrics.
  • make freeze-paper-notebook fails when notebook bytes change; passes when restored.