Spaces:
Configuration error
Configuration error
File size: 9,681 Bytes
b2594db | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | # Production Restructuring Plan
> Public, in-repo copy of the engineering plan that drives the transition from
> a single-notebook research project into a deployable multimodal AI platform.
> The original (with internal exploration notes) lives in the developer's
> `~/.claude/plans/` directory; this version is the canonical public artefact.
## Context
This repository is the engineering home of an IEEE-published image-captioning
research project. The published artefact is a single Jupyter notebook
([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb))
implementing **InceptionV3 (frozen) + custom Keras Transformer decoder**
trained on **COCO 2017**, reporting **BLEU ~24**.
**Goal**: convert the repo into a recruiter-grade, production-style
multimodal AI platform with a live free-tier demo, while **preserving the
IEEE notebook byte-for-byte** as the canonical research artefact.
**Constraints**:
- Hosting budget: **$0/month** β HuggingFace Spaces (backend) + Vercel free
(frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
- Multimodal scope (v1): **Tier 1 only** β add three pretrained HuggingFace
models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison
demo. Tier 2/3/4 are listed under *Future work* only.
---
## 1. Folder Structure (target)
```
image-captioning-system/
βββ notebooks/
β βββ 01_ieee_inceptionv3_transformer.ipynb # FROZEN
βββ src/captioning/ # Installable Python package
β βββ config/ # Pydantic settings + YAML loader
β βββ data/ # COCO loaders, preprocess, splits
β βββ tokenizer/ # CaptionTokenizer (Keras TextVectorization wrapper)
β βββ models/ # CNN encoder, Transformer decoder, factory
β βββ training/ # Trainer, losses, metrics, callbacks
β βββ inference/ # Greedy + beam search predictors
β βββ evaluation/ # BLEU, CIDEr, METEOR, ROUGE
β βββ io/ # Checkpoints, image decoding, HF Hub I/O
β βββ utils/ # Logging, seeding, timing
βββ configs/ # YAML hyperparameters (validated by Pydantic)
βββ scripts/ # CLI entrypoints (train, eval, predict, upload)
βββ models/ # Local checkpoint registry (gitignored content)
βββ backend/ # FastAPI service (depends on src/captioning)
βββ frontend/ # Next.js 14 + TypeScript + Tailwind + shadcn/ui
βββ tests/ # ML-core tests (unit + integration)
βββ docs/ # Architecture, ADRs, results, deployment
βββ .github/workflows/ # CI, CD, model-eval
βββ docker-compose.yml # Local dev: backend + frontend + mlflow
βββ pyproject.toml # Single source of truth for the package
βββ Makefile # Discoverable command index
```
**Key architectural rules**:
- `src/captioning/` is the ML core; `backend/app/` imports from it. Never
reverse the dependency.
- The IEEE notebook is **frozen** β `make freeze-paper-notebook` is a CI
check that fails on any byte change.
- Model weights are **never committed**; they live in HuggingFace Hub
(`yourname/captioning-weights`) and are downloaded at backend startup.
- Configuration is **YAML files validated by Pydantic v2 BaseSettings**, not
Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax.
---
## 2. Migration Strategy
**Approach: verbatim refactor first, improvements second.** Reproducibility
of the IEEE BLEU score is non-negotiable; behaviour parity must be proven
*before* any improvement is made.
### Phase 1a β "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook)
| Step | Notebook cell | Target module |
|---|---|---|
| 1 | Hyperparams | `configs/base.yaml` + `src/captioning/config/schema.py` |
| 2 | Caption preprocess | `data/preprocess.py::preprocess_caption` |
| 3 | COCO loader | `data/coco.py::load_coco_annotations` |
| 4 | Tokenizer | `tokenizer/vectorizer.py::CaptionTokenizer` |
| 5 | Splits | `data/splits.py::make_splits(seed=...)` |
| 6 | Image preprocess | `data/preprocess.py::preprocess_image` |
| 7 | tf.data pipeline | `data/pipeline.py::build_{train,val}_pipeline` |
| 8 | Augmentation | `data/augmentation.py::default_augmentation` |
| 9 | InceptionV3 encoder | `models/encoder_cnn.py` |
| 10 | Transformer encoder | `models/transformer_encoder.py` |
| 11 | Embeddings | `models/embeddings.py` |
| 12 | Transformer decoder | `models/transformer_decoder.py` |
| 13 | Captioning model | `models/captioning_model.py` |
| 14 | Wiring | `models/factory.py::build_caption_model(config)` |
| 15 | Loss + compile | `training/losses.py` + `training/trainer.py` |
| 16 | Fit | `training/trainer.py::Trainer.fit` |
| 17 | Inference | `inference/greedy.py`, `inference/predictor.py` |
| 18 | Save weights | `io/checkpoints.py` + `scripts/train.py` |
### Parity validation gate
`scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image
fixture and asserts:
- Tokenizer vocabulary identical (set equality).
- Image preprocessing tensor-equal (`np.allclose`, atol=1e-5).
- Model output logits equal at fixed weights (atol=1e-4).
- Captions on 20 fixed images byte-equal between notebook and module path.
### Phase 1b β Quality improvements (only after parity is green)
1. Masked accuracy metric (notebook tracks loss only).
2. Beam search inference.
3. Warmup + cosine LR schedule (replaces bare Adam).
4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
5. `vocab.json` sidecar alongside `vocab.pkl`.
6. Label smoothing.
---
## 3. Implementation Roadmap
| Phase | Deliverable | Effort | Recruiter signal |
|---|---|---|---|
| **0** | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 |
| **1** | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via `docker compose up` |
| **2** | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn |
| **3** | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share |
| **4** | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off |
### Future work (out of scope for v1)
- **Tier 2**: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β 32+).
- **Tier 3**: Anthropic Claude vision endpoint as a "Frontier" tab.
- **Tier 4**: VQA "Ask the image" extension reusing Tier 3 infra.
- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.
---
## 4. Deployment Stack (free-tier)
| Layer | Service | Why |
|---|---|---|
| Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable |
| Frontend hosting | Vercel free | Next.js native; per-PR preview URLs |
| Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards |
| Experiment tracking | MLflow on DagsHub free | Public read-only tracking server |
| Errors | Sentry free (5k errors/mo) | |
| Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper |
| Domain | None (use `*.hf.space` and `*.vercel.app`) | $0 budget |
---
## 5. Trade-offs Decided
| Decision | Alternative rejected | Reason |
|---|---|---|
| FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan |
| Next.js 14 App Router | Streamlit | Streamlit screams "research demo" |
| TanStack Query | Redux | Server state belongs in a server-state lib |
| YAML + Pydantic | Hydra | Hydra is overkill for 1β3 active configs |
| MLflow on DagsHub | W&B | DagsHub public free; no recruiter login |
| Keep TextVectorization | HF tokenizer in v1 | Changes vocab β breaks paper parity |
| Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable |
| `tensorflow-cpu==2.15.0` pinned | Floating TF | TF 2.16 broke Keras 2 compat with notebook |
| HF Spaces backend | Fly.io paid | Free-tier-only constraint |
| Multipart uploads | Base64 in JSON | 33% overhead, no streaming |
| `--workers 1` uvicorn | Multi-worker | TF graph + InceptionV3 ΓN OOMs |
| Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work |
---
## 6. Verification Plan
**Phase 1**:
- `pytest tests/ -v` β all green; coverage β₯ 70% on `src/captioning/`.
- `python scripts/notebook_module_audit.py` β parity assertions all pass.
- `docker compose up` β `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions`
returns valid caption JSON.
**Phase 2**:
- GitHub Actions `ci.yml` green on a PR.
- HF Space URL serves `/v1/model/info`.
- Vercel preview URL renders frontend; uploading a sample image returns a caption.
**Phase 3**:
- `GET /v1/models` returns 4 entries.
- `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU.
- `model-eval.yml` posts a BLEU comparison comment on a test PR.
**Phase 4**:
- `/metrics` exposes `caption_inference_seconds` histogram.
- DagsHub MLflow link shows β₯ 1 logged run with metrics.
- `make freeze-paper-notebook` fails when notebook bytes change; passes when restored.
|