Spaces:
Configuration error
Production Restructuring Plan
Public, in-repo copy of the engineering plan that drives the transition from a single-notebook research project into a deployable multimodal AI platform. The original (with internal exploration notes) lives in the developer's
~/.claude/plans/directory; this version is the canonical public artefact.
Context
This repository is the engineering home of an IEEE-published image-captioning
research project. The published artefact is a single Jupyter notebook
(notebooks/01_ieee_inceptionv3_transformer.ipynb)
implementing InceptionV3 (frozen) + custom Keras Transformer decoder
trained on COCO 2017, reporting BLEU ~24.
Goal: convert the repo into a recruiter-grade, production-style multimodal AI platform with a live free-tier demo, while preserving the IEEE notebook byte-for-byte as the canonical research artefact.
Constraints:
- Hosting budget: $0/month β HuggingFace Spaces (backend) + Vercel free (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
- Multimodal scope (v1): Tier 1 only β add three pretrained HuggingFace models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison demo. Tier 2/3/4 are listed under Future work only.
1. Folder Structure (target)
image-captioning-system/
βββ notebooks/
β βββ 01_ieee_inceptionv3_transformer.ipynb # FROZEN
βββ src/captioning/ # Installable Python package
β βββ config/ # Pydantic settings + YAML loader
β βββ data/ # COCO loaders, preprocess, splits
β βββ tokenizer/ # CaptionTokenizer (Keras TextVectorization wrapper)
β βββ models/ # CNN encoder, Transformer decoder, factory
β βββ training/ # Trainer, losses, metrics, callbacks
β βββ inference/ # Greedy + beam search predictors
β βββ evaluation/ # BLEU, CIDEr, METEOR, ROUGE
β βββ io/ # Checkpoints, image decoding, HF Hub I/O
β βββ utils/ # Logging, seeding, timing
βββ configs/ # YAML hyperparameters (validated by Pydantic)
βββ scripts/ # CLI entrypoints (train, eval, predict, upload)
βββ models/ # Local checkpoint registry (gitignored content)
βββ backend/ # FastAPI service (depends on src/captioning)
βββ frontend/ # Next.js 14 + TypeScript + Tailwind + shadcn/ui
βββ tests/ # ML-core tests (unit + integration)
βββ docs/ # Architecture, ADRs, results, deployment
βββ .github/workflows/ # CI, CD, model-eval
βββ docker-compose.yml # Local dev: backend + frontend + mlflow
βββ pyproject.toml # Single source of truth for the package
βββ Makefile # Discoverable command index
Key architectural rules:
src/captioning/is the ML core;backend/app/imports from it. Never reverse the dependency.- The IEEE notebook is frozen β
make freeze-paper-notebookis a CI check that fails on any byte change. - Model weights are never committed; they live in HuggingFace Hub
(
yourname/captioning-weights) and are downloaded at backend startup. - Configuration is YAML files validated by Pydantic v2 BaseSettings, not
Hydra. Env vars override via
CAPTIONING__TRAIN__BATCH_SIZE=32syntax.
2. Migration Strategy
Approach: verbatim refactor first, improvements second. Reproducibility of the IEEE BLEU score is non-negotiable; behaviour parity must be proven before any improvement is made.
Phase 1a β "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook)
| Step | Notebook cell | Target module |
|---|---|---|
| 1 | Hyperparams | configs/base.yaml + src/captioning/config/schema.py |
| 2 | Caption preprocess | data/preprocess.py::preprocess_caption |
| 3 | COCO loader | data/coco.py::load_coco_annotations |
| 4 | Tokenizer | tokenizer/vectorizer.py::CaptionTokenizer |
| 5 | Splits | data/splits.py::make_splits(seed=...) |
| 6 | Image preprocess | data/preprocess.py::preprocess_image |
| 7 | tf.data pipeline | data/pipeline.py::build_{train,val}_pipeline |
| 8 | Augmentation | data/augmentation.py::default_augmentation |
| 9 | InceptionV3 encoder | models/encoder_cnn.py |
| 10 | Transformer encoder | models/transformer_encoder.py |
| 11 | Embeddings | models/embeddings.py |
| 12 | Transformer decoder | models/transformer_decoder.py |
| 13 | Captioning model | models/captioning_model.py |
| 14 | Wiring | models/factory.py::build_caption_model(config) |
| 15 | Loss + compile | training/losses.py + training/trainer.py |
| 16 | Fit | training/trainer.py::Trainer.fit |
| 17 | Inference | inference/greedy.py, inference/predictor.py |
| 18 | Save weights | io/checkpoints.py + scripts/train.py |
Parity validation gate
scripts/notebook_module_audit.py runs both pipelines on a fixed 100-image
fixture and asserts:
- Tokenizer vocabulary identical (set equality).
- Image preprocessing tensor-equal (
np.allclose, atol=1e-5). - Model output logits equal at fixed weights (atol=1e-4).
- Captions on 20 fixed images byte-equal between notebook and module path.
Phase 1b β Quality improvements (only after parity is green)
- Masked accuracy metric (notebook tracks loss only).
- Beam search inference.
- Warmup + cosine LR schedule (replaces bare Adam).
- CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
vocab.jsonsidecar alongsidevocab.pkl.- Label smoothing.
3. Implementation Roadmap
| Phase | Deliverable | Effort | Recruiter signal |
|---|---|---|---|
| 0 | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 |
| 1 | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via docker compose up |
| 2 | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn |
| 3 | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share |
| 4 | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off |
Future work (out of scope for v1)
- Tier 2: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β 32+).
- Tier 3: Anthropic Claude vision endpoint as a "Frontier" tab.
- Tier 4: VQA "Ask the image" extension reusing Tier 3 infra.
- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.
4. Deployment Stack (free-tier)
| Layer | Service | Why |
|---|---|---|
| Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable |
| Frontend hosting | Vercel free | Next.js native; per-PR preview URLs |
| Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards |
| Experiment tracking | MLflow on DagsHub free | Public read-only tracking server |
| Errors | Sentry free (5k errors/mo) | |
| Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper |
| Domain | None (use *.hf.space and *.vercel.app) |
$0 budget |
5. Trade-offs Decided
| Decision | Alternative rejected | Reason |
|---|---|---|
| FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan |
| Next.js 14 App Router | Streamlit | Streamlit screams "research demo" |
| TanStack Query | Redux | Server state belongs in a server-state lib |
| YAML + Pydantic | Hydra | Hydra is overkill for 1β3 active configs |
| MLflow on DagsHub | W&B | DagsHub public free; no recruiter login |
| Keep TextVectorization | HF tokenizer in v1 | Changes vocab β breaks paper parity |
| Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable |
tensorflow-cpu==2.15.0 pinned |
Floating TF | TF 2.16 broke Keras 2 compat with notebook |
| HF Spaces backend | Fly.io paid | Free-tier-only constraint |
| Multipart uploads | Base64 in JSON | 33% overhead, no streaming |
--workers 1 uvicorn |
Multi-worker | TF graph + InceptionV3 ΓN OOMs |
| Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work |
6. Verification Plan
Phase 1:
pytest tests/ -vβ all green; coverage β₯ 70% onsrc/captioning/.python scripts/notebook_module_audit.pyβ parity assertions all pass.docker compose upβcurl -F "file=@sample.jpg" http://localhost:8000/v1/captionsreturns valid caption JSON.
Phase 2:
- GitHub Actions
ci.ymlgreen on a PR. - HF Space URL serves
/v1/model/info. - Vercel preview URL renders frontend; uploading a sample image returns a caption.
Phase 3:
GET /v1/modelsreturns 4 entries.POST /v1/comparereturns 4 captions; total latency < 15s on HF Space CPU.model-eval.ymlposts a BLEU comparison comment on a test PR.
Phase 4:
/metricsexposescaption_inference_secondshistogram.- DagsHub MLflow link shows β₯ 1 logged run with metrics.
make freeze-paper-notebookfails when notebook bytes change; passes when restored.