image-captioning-api / docs /PHASE_0_NOTES.md
apoorvrajdev's picture
feat: bootstrap production-grade ML repository tooling
b2594db
# Phase 0 — Bootstrap (decision log)
> Phase 0 establishes the engineering scaffolding the rest of the project will
> stand on. Nothing here changes the model; everything here changes how the
> repo *looks and behaves* to the next person who clones it (including
> recruiters and CI runners).
## What this phase delivers
| Artefact | Purpose |
|---|---|
| [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb) | Renamed from `image-captionin-using-dl.ipynb` via `git mv` to preserve history. Now the canonical, frozen IEEE artefact. |
| [`notebooks/README.md`](../notebooks/README.md) | Documents the frozen-notebook policy and conventions for any new notebooks. |
| [`pyproject.toml`](../pyproject.toml) | Single source of truth for the `captioning` Python package, dependency groups, and tool config (ruff/mypy/pytest/coverage). |
| [`requirements.txt`](../requirements.txt) | Pinned runtime deps, used directly by Docker and CI (mirrors `[project.dependencies]`). |
| [`requirements-dev.txt`](../requirements-dev.txt) | Pinned dev deps (lint, type-check, test, hooks). |
| [`requirements-eval.txt`](../requirements-eval.txt) | Pinned metric deps, kept separate to avoid bloating the serving image. |
| [`.python-version`](../.python-version) | Pins Python 3.10 for `pyenv` users. |
| [`.env.example`](../.env.example) | Schema for `pydantic-settings`-loaded env vars. |
| [`.pre-commit-config.yaml`](../.pre-commit-config.yaml) | Hooks: ruff, mypy, nbstripout, prettier (frontend), gitleaks. |
| [`Makefile`](../Makefile) | Discoverable command index (`make help`). |
| [`LICENSE`](../LICENSE) | MIT license, attribution to original author. |
| [`.gitignore`](../.gitignore) | Production-grade exclusions, organised by purpose with explanatory comments. |
| [`docs/restructure-plan.md`](./restructure-plan.md) | Public-facing engineering plan for Phases 0–4. |
---
## Decisions and reasoning
### 1. Why `src/` layout over flat layout?
A flat layout (`captioning/` at repo root) lets test code accidentally import
from the working tree instead of the *installed* package. That hides bugs that
would only surface in production, where the tree layout is gone. The `src/`
layout forces every test, every script, and every import to go through the
installed package — exactly the path users will follow. This is the layout
the [Python Packaging Authority recommends](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/),
and it's what production Python codebases (FastAPI, Pydantic, HTTPX) use.
### 2. Why `pyproject.toml` AND `requirements.txt`?
They serve different audiences:
- **`pyproject.toml`** is the *source of truth* for the package — its name,
version, abstract dependency ranges, optional extras, and tool configuration.
When you `pip install -e .[dev]`, this is what pip reads.
- **`requirements.txt`** is the *concretely pinned snapshot* — used by Docker
builds, CI runners, and anyone who wants `pip install -r requirements.txt`
without cloning the source. It's regenerable from `pyproject.toml` via
`pip-compile`, but committing it explicitly makes installs deterministic and
diffable.
Phase 5+ will switch to `pip-compile` for automated regeneration; for now,
manual mirroring is simpler and beginner-readable.
### 3. Why pin `tensorflow-cpu==2.15.0` so hard?
Two independent reasons stack:
1. **`tensorflow-cpu` (not `tensorflow`)**: the GPU build pulls ~600 MB of
CUDA libraries that are useless on CPU-only HuggingFace Spaces. Splitting
the wheel keeps the serving image well under 1.5 GB.
2. **2.15 specifically**: TF 2.16 swapped to Keras 3 by default. The IEEE
notebook uses `tf.keras.layers.TextVectorization` with the Keras 2
save/load API. Upgrading silently changes vocab serialisation, which
silently changes BLEU. Pinning is the difference between
*reproducible-published-result* and *reproducibility theatre*.
When Phase 5+ migrates to a modern multimodal backbone, this pin will move
in a deliberate, tested step — not by accident.
### 4. Why Ruff over Black + isort + flake8?
Ruff replaces all three with one tool that runs ~100x faster, reads config
from a single section in `pyproject.toml`, and ships its own formatter
(`ruff format`) that is byte-identical to Black's output. One install, one
config, one cache. Recruiters reading the repo see the modern Python tool;
CI runs faster; `make format` is one command, not three.
### 5. Why `nbstripout` is non-negotiable in pre-commit
Notebook outputs include base64-encoded images, full DataFrames, and
sometimes credentials printed by accident. Committed notebook diffs without
output stripping are unreadable (`+aaaaaaaaaa[base64]+aaaaa…`) and
occasionally leak data. `nbstripout` removes all output cells on commit,
keeping notebook history clean and reviewable.
### 6. Why include a `Makefile` on a Windows project?
Three reasons:
1. **CI runs on Linux** — every CI job uses the same Make targets, so the
commands you run locally match what CI runs.
2. **Discoverability**`make help` is one command that prints every
high-level operation with a one-line description. A new contributor (or
recruiter cloning the repo) sees the entire workflow in one screen.
3. **Tooling availability** — Make is a 5-second install on Windows
(`winget install GnuWin32.Make`, Git Bash, or WSL). PowerShell users who
skip Make can still read the Makefile and run the underlying commands
directly.
### 7. Why a `freeze-paper-notebook` Make target?
The IEEE paper points reviewers at the notebook. If the notebook drifts from
what the paper describes, reviewers running it will see numbers that don't
match the paper — and that's a scientific integrity issue, not a software
issue. The target hashes the notebook and asserts it matches a locked
SHA-256. Phase 4 wires this into CI as a required check on `main`.
### 8. Why split optional deps into `[hf]`, `[eval]`, `[mlflow]`, `[dev]`?
The slim production image (`backend:latest`) does NOT need transformers,
torch, pycocoevalcap, or MLflow. Bundling them adds ~1.5 GB of dependencies
the production code never imports. Extras let `pip install -e ".[hf]"` add
the HuggingFace baselines for the Phase 3 comparison demo, while
`pip install -r requirements.txt` keeps the production install lean.
### 9. Why MIT license?
The IEEE paper is published under IEEE's standard terms; the *code* is
covered separately. MIT is the most permissive widely recognised license —
it lets recruiters, students, and other researchers freely fork, learn from,
and extend the code. For a recruiter-grade portfolio project, permissive
licensing signals "I want this work to be useful," which is the right tone.
### 10. Why folder name `configs/` (plural), not `config/` (singular)?
`config/` was the empty folder shipped with the template. The plural form
`configs/` is the convention in modern Python ML projects (FastAPI's own
example apps, Hydra projects, the official `transformers` repo) because
it holds multiple files (one per environment, model variant, or run).
Phase 1 creates `configs/` with content; the empty `config/` folder will
be removed in the Phase 1 commit that introduces the YAML files.
---
## What this phase deliberately does NOT do
- **No code is moved out of the notebook yet.** That's Phase 1, behind a
parity validation gate.
- **No `src/captioning/` modules are created.** Empty `__init__.py` files
would just be churn; Phase 1 will create them with real code.
- **No Dockerfile or docker-compose.yml.** They depend on `backend/app/`
existing; both arrive in Phase 1.
- **No GitHub Actions workflows.** They live in Phase 2, after there is
Python code to lint and type-check.
- **No README rewrite.** The current README accurately describes the
research; the demo-link rewrite happens in Phase 2 once a live URL exists.
This restraint is deliberate. Each phase ships a coherent slice of value;
running ahead would create half-built features and vague commits.
---
## Local setup checklist for the developer
After pulling this commit, on a fresh dev box:
```bash
# 1. Create a Python 3.10 virtual environment.
python -m venv .venv
.venv\Scripts\activate # PowerShell
# source .venv/bin/activate # Linux/macOS
# 2. Install dev dependencies + the package (editable).
make install-dev
# Or, without Make:
# pip install -r requirements-dev.txt -r requirements-eval.txt
# pip install -e ".[hf,mlflow]"
# 3. Register pre-commit hooks.
make install-hooks
# Or: pre-commit install
# 4. (Optional) Lock the paper notebook's hash, so CI can enforce parity.
make lock-paper-notebook
# 5. Verify everything works.
make pre-commit # Run all hooks against all files
make test # No tests yet — exits cleanly with "no tests collected"
```
The first `make install-dev` will take a few minutes (TensorFlow is large).
Subsequent runs hit the wheel cache and complete in seconds.