Spaces:

apoorvrajdev
/

image-captioning-api

Configuration error

File size: 9,681 Bytes

b2594db

# Production Restructuring Plan

> Public, in-repo copy of the engineering plan that drives the transition from
> a single-notebook research project into a deployable multimodal AI platform.
> The original (with internal exploration notes) lives in the developer's
> `~/.claude/plans/` directory; this version is the canonical public artefact.

## Context

This repository is the engineering home of an IEEE-published image-captioning
research project. The published artefact is a single Jupyter notebook
([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb))
implementing **InceptionV3 (frozen) + custom Keras Transformer decoder**
trained on **COCO 2017**, reporting **BLEU ~24**.

**Goal**: convert the repo into a recruiter-grade, production-style
multimodal AI platform with a live free-tier demo, while **preserving the
IEEE notebook byte-for-byte** as the canonical research artefact.

**Constraints**:

- Hosting budget: **$0/month** → HuggingFace Spaces (backend) + Vercel free
  (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
- Multimodal scope (v1): **Tier 1 only** — add three pretrained HuggingFace
  models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison
  demo. Tier 2/3/4 are listed under *Future work* only.

---

## 1. Folder Structure (target)

```
image-captioning-system/
├── notebooks/
│   └── 01_ieee_inceptionv3_transformer.ipynb   # FROZEN
├── src/captioning/                             # Installable Python package
│   ├── config/                                 # Pydantic settings + YAML loader
│   ├── data/                                   # COCO loaders, preprocess, splits
│   ├── tokenizer/                              # CaptionTokenizer (Keras TextVectorization wrapper)
│   ├── models/                                 # CNN encoder, Transformer decoder, factory
│   ├── training/                               # Trainer, losses, metrics, callbacks
│   ├── inference/                              # Greedy + beam search predictors
│   ├── evaluation/                             # BLEU, CIDEr, METEOR, ROUGE
│   ├── io/                                     # Checkpoints, image decoding, HF Hub I/O
│   └── utils/                                  # Logging, seeding, timing
├── configs/                                    # YAML hyperparameters (validated by Pydantic)
├── scripts/                                    # CLI entrypoints (train, eval, predict, upload)
├── models/                                     # Local checkpoint registry (gitignored content)
├── backend/                                    # FastAPI service (depends on src/captioning)
├── frontend/                                   # Next.js 14 + TypeScript + Tailwind + shadcn/ui
├── tests/                                      # ML-core tests (unit + integration)
├── docs/                                       # Architecture, ADRs, results, deployment
├── .github/workflows/                          # CI, CD, model-eval
├── docker-compose.yml                          # Local dev: backend + frontend + mlflow
├── pyproject.toml                              # Single source of truth for the package
└── Makefile                                    # Discoverable command index
```

**Key architectural rules**:

- `src/captioning/` is the ML core; `backend/app/` imports from it. Never
  reverse the dependency.
- The IEEE notebook is **frozen** — `make freeze-paper-notebook` is a CI
  check that fails on any byte change.
- Model weights are **never committed**; they live in HuggingFace Hub
  (`yourname/captioning-weights`) and are downloaded at backend startup.
- Configuration is **YAML files validated by Pydantic v2 BaseSettings**, not
  Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax.

---

## 2. Migration Strategy

**Approach: verbatim refactor first, improvements second.** Reproducibility
of the IEEE BLEU score is non-negotiable; behaviour parity must be proven
*before* any improvement is made.

### Phase 1a — "Lift and shift" (parity goal: BLEU within ±0.3 of notebook)

| Step | Notebook cell | Target module |
|---|---|---|
| 1 | Hyperparams | `configs/base.yaml` + `src/captioning/config/schema.py` |
| 2 | Caption preprocess | `data/preprocess.py::preprocess_caption` |
| 3 | COCO loader | `data/coco.py::load_coco_annotations` |
| 4 | Tokenizer | `tokenizer/vectorizer.py::CaptionTokenizer` |
| 5 | Splits | `data/splits.py::make_splits(seed=...)` |
| 6 | Image preprocess | `data/preprocess.py::preprocess_image` |
| 7 | tf.data pipeline | `data/pipeline.py::build_{train,val}_pipeline` |
| 8 | Augmentation | `data/augmentation.py::default_augmentation` |
| 9 | InceptionV3 encoder | `models/encoder_cnn.py` |
| 10 | Transformer encoder | `models/transformer_encoder.py` |
| 11 | Embeddings | `models/embeddings.py` |
| 12 | Transformer decoder | `models/transformer_decoder.py` |
| 13 | Captioning model | `models/captioning_model.py` |
| 14 | Wiring | `models/factory.py::build_caption_model(config)` |
| 15 | Loss + compile | `training/losses.py` + `training/trainer.py` |
| 16 | Fit | `training/trainer.py::Trainer.fit` |
| 17 | Inference | `inference/greedy.py`, `inference/predictor.py` |
| 18 | Save weights | `io/checkpoints.py` + `scripts/train.py` |

### Parity validation gate

`scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image
fixture and asserts:

- Tokenizer vocabulary identical (set equality).
- Image preprocessing tensor-equal (`np.allclose`, atol=1e-5).
- Model output logits equal at fixed weights (atol=1e-4).
- Captions on 20 fixed images byte-equal between notebook and module path.

### Phase 1b — Quality improvements (only after parity is green)

1. Masked accuracy metric (notebook tracks loss only).
2. Beam search inference.
3. Warmup + cosine LR schedule (replaces bare Adam).
4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
5. `vocab.json` sidecar alongside `vocab.pkl`.
6. Label smoothing.

---

## 3. Implementation Roadmap

| Phase | Deliverable | Effort | Recruiter signal |
|---|---|---|---|
| **0** | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 |
| **1** | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via `docker compose up` |
| **2** | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn |
| **3** | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share |
| **4** | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off |

### Future work (out of scope for v1)

- **Tier 2**: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 → 32+).
- **Tier 3**: Anthropic Claude vision endpoint as a "Frontier" tab.
- **Tier 4**: VQA "Ask the image" extension reusing Tier 3 infra.
- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.

---

## 4. Deployment Stack (free-tier)

| Layer | Service | Why |
|---|---|---|
| Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable |
| Frontend hosting | Vercel free | Next.js native; per-PR preview URLs |
| Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards |
| Experiment tracking | MLflow on DagsHub free | Public read-only tracking server |
| Errors | Sentry free (5k errors/mo) | |
| Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper |
| Domain | None (use `*.hf.space` and `*.vercel.app`) | $0 budget |

---

## 5. Trade-offs Decided

| Decision | Alternative rejected | Reason |
|---|---|---|
| FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan |
| Next.js 14 App Router | Streamlit | Streamlit screams "research demo" |
| TanStack Query | Redux | Server state belongs in a server-state lib |
| YAML + Pydantic | Hydra | Hydra is overkill for 1–3 active configs |
| MLflow on DagsHub | W&B | DagsHub public free; no recruiter login |
| Keep TextVectorization | HF tokenizer in v1 | Changes vocab → breaks paper parity |
| Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable |
| `tensorflow-cpu==2.15.0` pinned | Floating TF | TF 2.16 broke Keras 2 compat with notebook |
| HF Spaces backend | Fly.io paid | Free-tier-only constraint |
| Multipart uploads | Base64 in JSON | 33% overhead, no streaming |
| `--workers 1` uvicorn | Multi-worker | TF graph + InceptionV3 ×N OOMs |
| Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work |

---

## 6. Verification Plan

**Phase 1**:

- `pytest tests/ -v` → all green; coverage ≥ 70% on `src/captioning/`.
- `python scripts/notebook_module_audit.py` → parity assertions all pass.
- `docker compose up` → `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions`
  returns valid caption JSON.

**Phase 2**:

- GitHub Actions `ci.yml` green on a PR.
- HF Space URL serves `/v1/model/info`.
- Vercel preview URL renders frontend; uploading a sample image returns a caption.

**Phase 3**:

- `GET /v1/models` returns 4 entries.
- `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU.
- `model-eval.yml` posts a BLEU comparison comment on a test PR.

**Phase 4**:

- `/metrics` exposes `caption_inference_seconds` histogram.
- DagsHub MLflow link shows ≥ 1 logged run with metrics.
- `make freeze-paper-notebook` fails when notebook bytes change; passes when restored.