File size: 9,681 Bytes
b2594db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
# Production Restructuring Plan

> Public, in-repo copy of the engineering plan that drives the transition from
> a single-notebook research project into a deployable multimodal AI platform.
> The original (with internal exploration notes) lives in the developer's
> `~/.claude/plans/` directory; this version is the canonical public artefact.

## Context

This repository is the engineering home of an IEEE-published image-captioning
research project. The published artefact is a single Jupyter notebook
([`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb))
implementing **InceptionV3 (frozen) + custom Keras Transformer decoder**
trained on **COCO 2017**, reporting **BLEU ~24**.

**Goal**: convert the repo into a recruiter-grade, production-style
multimodal AI platform with a live free-tier demo, while **preserving the
IEEE notebook byte-for-byte** as the canonical research artefact.

**Constraints**:

- Hosting budget: **$0/month** β†’ HuggingFace Spaces (backend) + Vercel free
  (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
- Multimodal scope (v1): **Tier 1 only** β€” add three pretrained HuggingFace
  models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison
  demo. Tier 2/3/4 are listed under *Future work* only.

---

## 1. Folder Structure (target)

```
image-captioning-system/
β”œβ”€β”€ notebooks/
β”‚   └── 01_ieee_inceptionv3_transformer.ipynb   # FROZEN
β”œβ”€β”€ src/captioning/                             # Installable Python package
β”‚   β”œβ”€β”€ config/                                 # Pydantic settings + YAML loader
β”‚   β”œβ”€β”€ data/                                   # COCO loaders, preprocess, splits
β”‚   β”œβ”€β”€ tokenizer/                              # CaptionTokenizer (Keras TextVectorization wrapper)
β”‚   β”œβ”€β”€ models/                                 # CNN encoder, Transformer decoder, factory
β”‚   β”œβ”€β”€ training/                               # Trainer, losses, metrics, callbacks
β”‚   β”œβ”€β”€ inference/                              # Greedy + beam search predictors
β”‚   β”œβ”€β”€ evaluation/                             # BLEU, CIDEr, METEOR, ROUGE
β”‚   β”œβ”€β”€ io/                                     # Checkpoints, image decoding, HF Hub I/O
β”‚   └── utils/                                  # Logging, seeding, timing
β”œβ”€β”€ configs/                                    # YAML hyperparameters (validated by Pydantic)
β”œβ”€β”€ scripts/                                    # CLI entrypoints (train, eval, predict, upload)
β”œβ”€β”€ models/                                     # Local checkpoint registry (gitignored content)
β”œβ”€β”€ backend/                                    # FastAPI service (depends on src/captioning)
β”œβ”€β”€ frontend/                                   # Next.js 14 + TypeScript + Tailwind + shadcn/ui
β”œβ”€β”€ tests/                                      # ML-core tests (unit + integration)
β”œβ”€β”€ docs/                                       # Architecture, ADRs, results, deployment
β”œβ”€β”€ .github/workflows/                          # CI, CD, model-eval
β”œβ”€β”€ docker-compose.yml                          # Local dev: backend + frontend + mlflow
β”œβ”€β”€ pyproject.toml                              # Single source of truth for the package
└── Makefile                                    # Discoverable command index
```

**Key architectural rules**:

- `src/captioning/` is the ML core; `backend/app/` imports from it. Never
  reverse the dependency.
- The IEEE notebook is **frozen** β€” `make freeze-paper-notebook` is a CI
  check that fails on any byte change.
- Model weights are **never committed**; they live in HuggingFace Hub
  (`yourname/captioning-weights`) and are downloaded at backend startup.
- Configuration is **YAML files validated by Pydantic v2 BaseSettings**, not
  Hydra. Env vars override via `CAPTIONING__TRAIN__BATCH_SIZE=32` syntax.

---

## 2. Migration Strategy

**Approach: verbatim refactor first, improvements second.** Reproducibility
of the IEEE BLEU score is non-negotiable; behaviour parity must be proven
*before* any improvement is made.

### Phase 1a β€” "Lift and shift" (parity goal: BLEU within Β±0.3 of notebook)

| Step | Notebook cell | Target module |
|---|---|---|
| 1 | Hyperparams | `configs/base.yaml` + `src/captioning/config/schema.py` |
| 2 | Caption preprocess | `data/preprocess.py::preprocess_caption` |
| 3 | COCO loader | `data/coco.py::load_coco_annotations` |
| 4 | Tokenizer | `tokenizer/vectorizer.py::CaptionTokenizer` |
| 5 | Splits | `data/splits.py::make_splits(seed=...)` |
| 6 | Image preprocess | `data/preprocess.py::preprocess_image` |
| 7 | tf.data pipeline | `data/pipeline.py::build_{train,val}_pipeline` |
| 8 | Augmentation | `data/augmentation.py::default_augmentation` |
| 9 | InceptionV3 encoder | `models/encoder_cnn.py` |
| 10 | Transformer encoder | `models/transformer_encoder.py` |
| 11 | Embeddings | `models/embeddings.py` |
| 12 | Transformer decoder | `models/transformer_decoder.py` |
| 13 | Captioning model | `models/captioning_model.py` |
| 14 | Wiring | `models/factory.py::build_caption_model(config)` |
| 15 | Loss + compile | `training/losses.py` + `training/trainer.py` |
| 16 | Fit | `training/trainer.py::Trainer.fit` |
| 17 | Inference | `inference/greedy.py`, `inference/predictor.py` |
| 18 | Save weights | `io/checkpoints.py` + `scripts/train.py` |

### Parity validation gate

`scripts/notebook_module_audit.py` runs both pipelines on a fixed 100-image
fixture and asserts:

- Tokenizer vocabulary identical (set equality).
- Image preprocessing tensor-equal (`np.allclose`, atol=1e-5).
- Model output logits equal at fixed weights (atol=1e-4).
- Captions on 20 fixed images byte-equal between notebook and module path.

### Phase 1b β€” Quality improvements (only after parity is green)

1. Masked accuracy metric (notebook tracks loss only).
2. Beam search inference.
3. Warmup + cosine LR schedule (replaces bare Adam).
4. CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
5. `vocab.json` sidecar alongside `vocab.pkl`.
6. Label smoothing.

---

## 3. Implementation Roadmap

| Phase | Deliverable | Effort | Recruiter signal |
|---|---|---|---|
| **0** | Repo bootstrap (this phase) | 3 hrs | Clean repo, lint passes from commit 1 |
| **1** | Modular ML core + backend MVP | ~15 hrs | Working FastAPI for the IEEE model, runnable via `docker compose up` |
| **2** | CI/CD + first deploy (HF Space + Vercel) | ~12 hrs | Live demo URL on LinkedIn |
| **3** | Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo | ~20 hrs | The screenshot recruiters share |
| **4** | Polish + observability (Sentry, Prometheus, ADRs) | ~8 hrs | Reads as production-grade, not a research one-off |

### Future work (out of scope for v1)

- **Tier 2**: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 β†’ 32+).
- **Tier 3**: Anthropic Claude vision endpoint as a "Frontier" tab.
- **Tier 4**: VQA "Ask the image" extension reusing Tier 3 infra.
- Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.

---

## 4. Deployment Stack (free-tier)

| Layer | Service | Why |
|---|---|---|
| Backend hosting | HuggingFace Spaces (Docker SDK, free CPU) | 16 GB RAM, ML-native, recruiter-clickable |
| Frontend hosting | Vercel free | Next.js native; per-PR preview URLs |
| Model artefacts | HuggingFace Hub | Free, unlimited public, versioned, model cards |
| Experiment tracking | MLflow on DagsHub free | Public read-only tracking server |
| Errors | Sentry free (5k errors/mo) | |
| Uptime | UptimeRobot free | Doubles as HF Space wake-up keeper |
| Domain | None (use `*.hf.space` and `*.vercel.app`) | $0 budget |

---

## 5. Trade-offs Decided

| Decision | Alternative rejected | Reason |
|---|---|---|
| FastAPI | Flask | Async, OpenAPI, Pydantic, lifespan |
| Next.js 14 App Router | Streamlit | Streamlit screams "research demo" |
| TanStack Query | Redux | Server state belongs in a server-state lib |
| YAML + Pydantic | Hydra | Hydra is overkill for 1–3 active configs |
| MLflow on DagsHub | W&B | DagsHub public free; no recruiter login |
| Keep TextVectorization | HF tokenizer in v1 | Changes vocab β†’ breaks paper parity |
| Verbatim refactor first | Clean rewrite | IEEE BLEU reproducibility non-negotiable |
| `tensorflow-cpu==2.15.0` pinned | Floating TF | TF 2.16 broke Keras 2 compat with notebook |
| HF Spaces backend | Fly.io paid | Free-tier-only constraint |
| Multipart uploads | Base64 in JSON | 33% overhead, no streaming |
| `--workers 1` uvicorn | Multi-worker | TF graph + InceptionV3 Γ—N OOMs |
| Tier 1 only (HF baselines) | Tier 2/3/4 in v1 | User selected Tier 1; others as future work |

---

## 6. Verification Plan

**Phase 1**:

- `pytest tests/ -v` β†’ all green; coverage β‰₯ 70% on `src/captioning/`.
- `python scripts/notebook_module_audit.py` β†’ parity assertions all pass.
- `docker compose up` β†’ `curl -F "file=@sample.jpg" http://localhost:8000/v1/captions`
  returns valid caption JSON.

**Phase 2**:

- GitHub Actions `ci.yml` green on a PR.
- HF Space URL serves `/v1/model/info`.
- Vercel preview URL renders frontend; uploading a sample image returns a caption.

**Phase 3**:

- `GET /v1/models` returns 4 entries.
- `POST /v1/compare` returns 4 captions; total latency < 15s on HF Space CPU.
- `model-eval.yml` posts a BLEU comparison comment on a test PR.

**Phase 4**:

- `/metrics` exposes `caption_inference_seconds` histogram.
- DagsHub MLflow link shows β‰₯ 1 logged run with metrics.
- `make freeze-paper-notebook` fails when notebook bytes change; passes when restored.