Spaces:
Configuration error
Configuration error
| # Production Restructuring Plan | |
| > Public, in-repo copy of the engineering plan that drives the transition from | |
| > a single-notebook research project into a deployable multimodal AI platform. | |
| > The original (with internal exploration notes) lives in the developer's | |
| > `~/.claude/plans/` directory; this version is the canonical public artefact. | |
| ## Context | |
| This repository is the engineering home of an IEEE-published image-captioning | |
| research project. The published artefact is a single Jupyter notebook | |
| ([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb)) | |
| implementing **InceptionV3 (frozen) + custom Keras Transformer decoder** | |
| trained on **COCO 2017**, reporting **BLEU ~24**. | |
| **Goal**: convert the repo into a recruiter-grade, production-style | |
| multimodal AI platform with a live free-tier demo, while **preserving the | |
| IEEE notebook byte-for-byte** as the canonical research artefact. | |
| **Constraints**: | |
| - Hosting budget: **$0/month** β HuggingFace Spaces (backend) + Vercel free | |
| (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow. | |
| - Multimodal scope (v1): **Tier 1 only** β add three pretrained HuggingFace | |
| models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison | |
| demo. Tier 2/3/4 are listed under *Future work* only. | |
| --- | |
| ## 1. Folder Structure (target) | |
| ``` | |
| image-captioning-system/ | |
| βββ notebooks/ | |
| β βββ 01_ieee_inceptionv3_transformer.ipynb # FROZEN | |
| βββ src/captioning/ # Installable Python package | |
| β βββ config/ # Pydantic settings + YAML loader | |
| β βββ data/ # COCO loaders, preprocess, splits | |
| β βββ tokenizer/ # CaptionTokenizer (Keras TextVectorization wrapper) | |
| β βββ models/ # CNN encoder, Transformer decoder, factory | |
| β βββ training/ # Trainer, losses, metrics, callbacks | |
| β βββ inference/ # Greedy + beam search predictors | |
| β βββ evaluation/ # BLEU, CIDEr, METEOR, ROUGE | |
| β βββ io/ # Checkpoints, image decoding, HF Hub I/O | |
| β βββ utils/ # Logging, seeding, timing | |
| βββ configs/ # YAML hyperparameters (validated by Pydantic) | |
| βββ scripts/ # CLI entrypoints (train, eval, predict, upload) | |
| βββ models/ # Local checkpoint registry (gitignored content) | |
| βββ backend/ # FastAPI service (depends on src/captioning) | |
| βββ frontend/ # Next.js 14 + TypeScript + Tailwind + shadcn/ui | |
| βββ tests/ # ML-core tests (unit + integration) | |
| βββ docs/ # Architecture, ADRs, results, deployment | |
| βββ .github/workflows/ # CI, CD, model-eval | |
| βββ docker-compose.yml # Local dev: backend + frontend + mlflow | |
| βββ pyproject.toml # Single source of truth for the package | |
| βββ Makefile # Discoverable command index | |
| ``` | |
| **Key architectural rules**: | |
| - `src/captioning/` is the ML core; `backend/app/` imports from it. Never | |
| reverse the dependency. | |
| - The IEEE notebook is **frozen** β `make freeze-paper-notebook` is a CI | |
| check that fails on any byte change. | |
| - Model weights are **never committed**; they live in HuggingFace Hub | |
| (`yourname/captioning-weights`) and are downloaded at backend startup. | |
| - Configuration is **YAML files validated by Pydantic v2 BaseSettings**, not | |
| Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax. | |
| --- | |
| ## 2. Migration Strategy | |
| **Approach: verbatim refactor first, improvements second.** Reproducibility | |
| of the IEEE BLEU score is non-negotiable; behaviour parity must be proven | |
| *before* any improvement is made. | |
| ### Phase 1a β "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook) | |
| | Step | Notebook cell | Target module | | |
| |---|---|---| | |
| | 1 | Hyperparams | `configs/base.yaml` + `src/captioning/config/schema.py` | | |
| | 2 | Caption preprocess | `data/preprocess.py::preprocess_caption` | | |
| | 3 | COCO loader | `data/coco.py::load_coco_annotations` | | |
| | 4 | Tokenizer | `tokenizer/vectorizer.py::CaptionTokenizer` | | |
| | 5 | Splits | `data/splits.py::make_splits(seed=...)` | | |
| | 6 | Image preprocess | `data/preprocess.py::preprocess_image` | | |
| | 7 | tf.data pipeline | `data/pipeline.py::build_{train,val}_pipeline` | | |
| | 8 | Augmentation | `data/augmentation.py::default_augmentation` | | |
| | 9 | InceptionV3 encoder | `models/encoder_cnn.py` | | |
| | 10 | Transformer encoder | `models/transformer_encoder.py` | | |
| | 11 | Embeddings | `models/embeddings.py` | | |
| | 12 | Transformer decoder | `models/transformer_decoder.py` | | |
| | 13 | Captioning model | `models/captioning_model.py` | | |
| | 14 | Wiring | `models/factory.py::build_caption_model(config)` | | |
| | 15 | Loss + compile | `training/losses.py` + `training/trainer.py` | | |
| | 16 | Fit | `training/trainer.py::Trainer.fit` | | |
| | 17 | Inference | `inference/greedy.py`, `inference/predictor.py` | | |
| | 18 | Save weights | `io/checkpoints.py` + `scripts/train.py` | | |
| ### Parity validation gate | |
| `scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image | |
| fixture and asserts: | |
| - Tokenizer vocabulary identical (set equality). | |
| - Image preprocessing tensor-equal (`np.allclose`, atol=1e-5). | |
| - Model output logits equal at fixed weights (atol=1e-4). | |
| - Captions on 20 fixed images byte-equal between notebook and module path. | |
| ### Phase 1b β Quality improvements (only after parity is green) | |
| 1. Masked accuracy metric (notebook tracks loss only). | |
| 2. Beam search inference. | |
| 3. Warmup + cosine LR schedule (replaces bare Adam). | |
| 4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only). | |
| 5. `vocab.json` sidecar alongside `vocab.pkl`. | |
| 6. Label smoothing. | |
| --- | |
| ## 3. Implementation Roadmap | |
| | Phase | Deliverable | Effort | Recruiter signal | | |
| |---|---|---|---| | |
| | **0** | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 | | |
| | **1** | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via `docker compose up` | | |
| | **2** | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn | | |
| | **3** | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share | | |
| | **4** | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off | | |
| ### Future work (out of scope for v1) | |
| - **Tier 2**: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β 32+). | |
| - **Tier 3**: Anthropic Claude vision endpoint as a "Frontier" tab. | |
| - **Tier 4**: VQA "Ask the image" extension reusing Tier 3 infra. | |
| - Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning. | |
| --- | |
| ## 4. Deployment Stack (free-tier) | |
| | Layer | Service | Why | | |
| |---|---|---| | |
| | Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable | | |
| | Frontend hosting | Vercel free | Next.js native; per-PR preview URLs | | |
| | Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards | | |
| | Experiment tracking | MLflow on DagsHub free | Public read-only tracking server | | |
| | Errors | Sentry free (5k errors/mo) | | | |
| | Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper | | |
| | Domain | None (use `*.hf.space` and `*.vercel.app`) | $0 budget | | |
| --- | |
| ## 5. Trade-offs Decided | |
| | Decision | Alternative rejected | Reason | | |
| |---|---|---| | |
| | FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan | | |
| | Next.js 14 App Router | Streamlit | Streamlit screams "research demo" | | |
| | TanStack Query | Redux | Server state belongs in a server-state lib | | |
| | YAML + Pydantic | Hydra | Hydra is overkill for 1β3 active configs | | |
| | MLflow on DagsHub | W&B | DagsHub public free; no recruiter login | | |
| | Keep TextVectorization | HF tokenizer in v1 | Changes vocab β breaks paper parity | | |
| | Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable | | |
| | `tensorflow-cpu==2.15.0` pinned | Floating TF | TF 2.16 broke Keras 2 compat with notebook | | |
| | HF Spaces backend | Fly.io paid | Free-tier-only constraint | | |
| | Multipart uploads | Base64 in JSON | 33% overhead, no streaming | | |
| | `--workers 1` uvicorn | Multi-worker | TF graph + InceptionV3 ΓN OOMs | | |
| | Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work | | |
| --- | |
| ## 6. Verification Plan | |
| **Phase 1**: | |
| - `pytest tests/ -v` β all green; coverage β₯ 70% on `src/captioning/`. | |
| - `python scripts/notebook_module_audit.py` β parity assertions all pass. | |
| - `docker compose up` β `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions` | |
| returns valid caption JSON. | |
| **Phase 2**: | |
| - GitHub Actions `ci.yml` green on a PR. | |
| - HF Space URL serves `/v1/model/info`. | |
| - Vercel preview URL renders frontend; uploading a sample image returns a caption. | |
| **Phase 3**: | |
| - `GET /v1/models` returns 4 entries. | |
| - `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU. | |
| - `model-eval.yml` posts a BLEU comparison comment on a test PR. | |
| **Phase 4**: | |
| - `/metrics` exposes `caption_inference_seconds` histogram. | |
| - DagsHub MLflow link shows β₯ 1 logged run with metrics. | |
| - `make freeze-paper-notebook` fails when notebook bytes change; passes when restored. | |