Spaces:

apoorvrajdev
/

image-captioning-api

Configuration error

App Files Files Community

image-captioning-api / docs /restructure-plan.md

apoorvrajdev

feat: bootstrap production-grade ML repository tooling

b2594db 27 days ago

preview code

raw

history blame contribute delete

9.68 kB

Production Restructuring Plan

Public, in-repo copy of the engineering plan that drives the transition from a single-notebook research project into a deployable multimodal AI platform. The original (with internal exploration notes) lives in the developer's ~/.claude/plans/ directory; this version is the canonical public artefact.

Context

This repository is the engineering home of an IEEE-published image-captioning research project. The published artefact is a single Jupyter notebook (notebooks/01_ieee_inceptionv3_transformer.ipynb) implementing InceptionV3 (frozen) + custom Keras Transformer decoder trained on COCO 2017, reporting BLEU ~24.

Goal: convert the repo into a recruiter-grade, production-style multimodal AI platform with a live free-tier demo, while preserving the IEEE notebook byte-for-byte as the canonical research artefact.

Constraints:

Hosting budget: $0/month → HuggingFace Spaces (backend) + Vercel free (frontend) + HuggingFace Hub (model artefacts) + DagsHub free MLflow.
Multimodal scope (v1): Tier 1 only — add three pretrained HuggingFace models (BLIP-base, ViT-GPT2, GIT-base-coco) for a side-by-side comparison demo. Tier 2/3/4 are listed under Future work only.

1. Folder Structure (target)

image-captioning-system/
├── notebooks/
│   └── 01_ieee_inceptionv3_transformer.ipynb   # FROZEN
├── src/captioning/                             # Installable Python package
│   ├── config/                                 # Pydantic settings + YAML loader
│   ├── data/                                   # COCO loaders, preprocess, splits
│   ├── tokenizer/                              # CaptionTokenizer (Keras TextVectorization wrapper)
│   ├── models/                                 # CNN encoder, Transformer decoder, factory
│   ├── training/                               # Trainer, losses, metrics, callbacks
│   ├── inference/                              # Greedy + beam search predictors
│   ├── evaluation/                             # BLEU, CIDEr, METEOR, ROUGE
│   ├── io/                                     # Checkpoints, image decoding, HF Hub I/O
│   └── utils/                                  # Logging, seeding, timing
├── configs/                                    # YAML hyperparameters (validated by Pydantic)
├── scripts/                                    # CLI entrypoints (train, eval, predict, upload)
├── models/                                     # Local checkpoint registry (gitignored content)
├── backend/                                    # FastAPI service (depends on src/captioning)
├── frontend/                                   # Next.js 14 + TypeScript + Tailwind + shadcn/ui
├── tests/                                      # ML-core tests (unit + integration)
├── docs/                                       # Architecture, ADRs, results, deployment
├── .github/workflows/                          # CI, CD, model-eval
├── docker-compose.yml                          # Local dev: backend + frontend + mlflow
├── pyproject.toml                              # Single source of truth for the package
└── Makefile                                    # Discoverable command index

Key architectural rules:

src/captioning/ is the ML core; backend/app/ imports from it. Never reverse the dependency.
The IEEE notebook is frozen — make freeze-paper-notebook is a CI check that fails on any byte change.
Model weights are never committed; they live in HuggingFace Hub (yourname/captioning-weights) and are downloaded at backend startup.
Configuration is YAML files validated by Pydantic v2 BaseSettings, not Hydra. Env vars override via CAPTIONING__TRAIN__BATCH_SIZE=32 syntax.

2. Migration Strategy

Approach: verbatim refactor first, improvements second. Reproducibility of the IEEE BLEU score is non-negotiable; behaviour parity must be proven before any improvement is made.

Phase 1a — "Lift and shift" (parity goal: BLEU within ±0.3 of notebook)

Step	Notebook cell	Target module
1	Hyperparams	`configs/base.yaml` + `src/captioning/config/schema.py`
2	Caption preprocess	`data/preprocess.py::preprocess_caption`
3	COCO loader	`data/coco.py::load_coco_annotations`
4	Tokenizer	`tokenizer/vectorizer.py::CaptionTokenizer`
5	Splits	`data/splits.py::make_splits(seed=...)`
6	Image preprocess	`data/preprocess.py::preprocess_image`
7	tf.data pipeline	`data/pipeline.py::build_{train,val}_pipeline`
8	Augmentation	`data/augmentation.py::default_augmentation`
9	InceptionV3 encoder	`models/encoder_cnn.py`
10	Transformer encoder	`models/transformer_encoder.py`
11	Embeddings	`models/embeddings.py`
12	Transformer decoder	`models/transformer_decoder.py`
13	Captioning model	`models/captioning_model.py`
14	Wiring	`models/factory.py::build_caption_model(config)`
15	Loss + compile	`training/losses.py` + `training/trainer.py`
16	Fit	`training/trainer.py::Trainer.fit`
17	Inference	`inference/greedy.py`, `inference/predictor.py`
18	Save weights	`io/checkpoints.py` + `scripts/train.py`

Parity validation gate

scripts/notebook_module_audit.py runs both pipelines on a fixed 100-image fixture and asserts:

Tokenizer vocabulary identical (set equality).
Image preprocessing tensor-equal (np.allclose, atol=1e-5).
Model output logits equal at fixed weights (atol=1e-4).
Captions on 20 fixed images byte-equal between notebook and module path.

Phase 1b — Quality improvements (only after parity is green)

Masked accuracy metric (notebook tracks loss only).
Beam search inference.
Warmup + cosine LR schedule (replaces bare Adam).
CIDEr / METEOR / ROUGE-L (paper reports BLEU only).
vocab.json sidecar alongside vocab.pkl.
Label smoothing.

3. Implementation Roadmap

Phase	Deliverable	Effort	Recruiter signal
0	Repo bootstrap (this phase)	3 hrs	Clean repo, lint passes from commit 1
1	Modular ML core + backend MVP	~15 hrs	Working FastAPI for the IEEE model, runnable via `docker compose up`
2	CI/CD + first deploy (HF Space + Vercel)	~12 hrs	Live demo URL on LinkedIn
3	Tier 1 multimodal: BLIP/ViT-GPT2/GIT comparison demo	~20 hrs	The screenshot recruiters share
4	Polish + observability (Sentry, Prometheus, ADRs)	~8 hrs	Reads as production-grade, not a research one-off

Future work (out of scope for v1)

Tier 2: ViT + Transformer fine-tune on COCO via Kaggle GPU (BLEU 24 → 32+).
Tier 3: Anthropic Claude vision endpoint as a "Frontier" tab.
Tier 4: VQA "Ask the image" extension reusing Tier 3 infra.
Self-hosted compose on a VPS with Caddy TLS and DVC dataset versioning.

4. Deployment Stack (free-tier)

Layer	Service	Why
Backend hosting	HuggingFace Spaces (Docker SDK, free CPU)	16 GB RAM, ML-native, recruiter-clickable
Frontend hosting	Vercel free	Next.js native; per-PR preview URLs
Model artefacts	HuggingFace Hub	Free, unlimited public, versioned, model cards
Experiment tracking	MLflow on DagsHub free	Public read-only tracking server
Errors	Sentry free (5k errors/mo)
Uptime	UptimeRobot free	Doubles as HF Space wake-up keeper
Domain	None (use `.hf.space` and `.vercel.app`)	$0 budget

5. Trade-offs Decided

Decision	Alternative rejected	Reason
FastAPI	Flask	Async, OpenAPI, Pydantic, lifespan
Next.js 14 App Router	Streamlit	Streamlit screams "research demo"
TanStack Query	Redux	Server state belongs in a server-state lib
YAML + Pydantic	Hydra	Hydra is overkill for 1–3 active configs
MLflow on DagsHub	W&B	DagsHub public free; no recruiter login
Keep TextVectorization	HF tokenizer in v1	Changes vocab → breaks paper parity
Verbatim refactor first	Clean rewrite	IEEE BLEU reproducibility non-negotiable
`tensorflow-cpu==2.15.0` pinned	Floating TF	TF 2.16 broke Keras 2 compat with notebook
HF Spaces backend	Fly.io paid	Free-tier-only constraint
Multipart uploads	Base64 in JSON	33% overhead, no streaming
`--workers 1` uvicorn	Multi-worker	TF graph + InceptionV3 ×N OOMs
Tier 1 only (HF baselines)	Tier 2/3/4 in v1	User selected Tier 1; others as future work

6. Verification Plan

Phase 1:

pytest tests/ -v → all green; coverage ≥ 70% on src/captioning/.
python scripts/notebook_module_audit.py → parity assertions all pass.
docker compose up → curl -F "file=@sample.jpg" http://localhost:8000/v1/captions returns valid caption JSON.

Phase 2:

GitHub Actions ci.yml green on a PR.
HF Space URL serves /v1/model/info.
Vercel preview URL renders frontend; uploading a sample image returns a caption.

Phase 3:

GET /v1/models returns 4 entries.
POST /v1/compare returns 4 captions; total latency < 15s on HF Space CPU.
model-eval.yml posts a BLEU comparison comment on a test PR.

Phase 4:

/metrics exposes caption_inference_seconds histogram.
DagsHub MLflow link shows ≥ 1 logged run with metrics.
make freeze-paper-notebook fails when notebook bytes change; passes when restored.