Spaces:

apoorvrajdev
/

image-captioning-api

Configuration error

App Files Files Community

image-captioning-api / docs /PHASE_0_NOTES.md

apoorvrajdev

feat: bootstrap production-grade ML repository tooling

b2594db 27 days ago

preview code

raw

history blame contribute delete

9.06 kB

Phase 0 — Bootstrap (decision log)

Phase 0 establishes the engineering scaffolding the rest of the project will stand on. Nothing here changes the model; everything here changes how the repo looks and behaves to the next person who clones it (including recruiters and CI runners).

What this phase delivers

Artefact	Purpose
`notebooks/01_ieee_inceptionv3_transformer.ipynb`	Renamed from `image-captionin-using-dl.ipynb` via `git mv` to preserve history. Now the canonical, frozen IEEE artefact.
`notebooks/README.md`	Documents the frozen-notebook policy and conventions for any new notebooks.
`pyproject.toml`	Single source of truth for the `captioning` Python package, dependency groups, and tool config (ruff/mypy/pytest/coverage).
`requirements.txt`	Pinned runtime deps, used directly by Docker and CI (mirrors `[project.dependencies]`).
`requirements-dev.txt`	Pinned dev deps (lint, type-check, test, hooks).
`requirements-eval.txt`	Pinned metric deps, kept separate to avoid bloating the serving image.
`.python-version`	Pins Python 3.10 for `pyenv` users.
`.env.example`	Schema for `pydantic-settings`-loaded env vars.
`.pre-commit-config.yaml`	Hooks: ruff, mypy, nbstripout, prettier (frontend), gitleaks.
`Makefile`	Discoverable command index (`make help`).
`LICENSE`	MIT license, attribution to original author.
`.gitignore`	Production-grade exclusions, organised by purpose with explanatory comments.
`docs/restructure-plan.md`	Public-facing engineering plan for Phases 0–4.

Decisions and reasoning

1. Why `src/` layout over flat layout?

A flat layout (captioning/ at repo root) lets test code accidentally import from the working tree instead of the installed package. That hides bugs that would only surface in production, where the tree layout is gone. The src/ layout forces every test, every script, and every import to go through the installed package — exactly the path users will follow. This is the layout the Python Packaging Authority recommends, and it's what production Python codebases (FastAPI, Pydantic, HTTPX) use.

2. Why `pyproject.toml` AND `requirements.txt`?

They serve different audiences:

pyproject.toml is the source of truth for the package — its name, version, abstract dependency ranges, optional extras, and tool configuration. When you pip install -e .[dev], this is what pip reads.
requirements.txt is the concretely pinned snapshot — used by Docker builds, CI runners, and anyone who wants pip install -r requirements.txt without cloning the source. It's regenerable from pyproject.toml via pip-compile, but committing it explicitly makes installs deterministic and diffable.

Phase 5+ will switch to pip-compile for automated regeneration; for now, manual mirroring is simpler and beginner-readable.

3. Why pin `tensorflow-cpu==2.15.0` so hard?

Two independent reasons stack:

tensorflow-cpu (not tensorflow): the GPU build pulls ~600 MB of CUDA libraries that are useless on CPU-only HuggingFace Spaces. Splitting the wheel keeps the serving image well under 1.5 GB.
2.15 specifically: TF 2.16 swapped to Keras 3 by default. The IEEE notebook uses tf.keras.layers.TextVectorization with the Keras 2 save/load API. Upgrading silently changes vocab serialisation, which silently changes BLEU. Pinning is the difference between reproducible-published-result and reproducibility theatre.

When Phase 5+ migrates to a modern multimodal backbone, this pin will move in a deliberate, tested step — not by accident.

4. Why Ruff over Black + isort + flake8?

Ruff replaces all three with one tool that runs ~100x faster, reads config from a single section in pyproject.toml, and ships its own formatter (ruff format) that is byte-identical to Black's output. One install, one config, one cache. Recruiters reading the repo see the modern Python tool; CI runs faster; make format is one command, not three.

5. Why `nbstripout` is non-negotiable in pre-commit

Notebook outputs include base64-encoded images, full DataFrames, and sometimes credentials printed by accident. Committed notebook diffs without output stripping are unreadable (+aaaaaaaaaa[base64]+aaaaa…) and occasionally leak data. nbstripout removes all output cells on commit, keeping notebook history clean and reviewable.

6. Why include a `Makefile` on a Windows project?

Three reasons:

CI runs on Linux — every CI job uses the same Make targets, so the commands you run locally match what CI runs.
Discoverability — make help is one command that prints every high-level operation with a one-line description. A new contributor (or recruiter cloning the repo) sees the entire workflow in one screen.
Tooling availability — Make is a 5-second install on Windows (winget install GnuWin32.Make, Git Bash, or WSL). PowerShell users who skip Make can still read the Makefile and run the underlying commands directly.

7. Why a `freeze-paper-notebook` Make target?

The IEEE paper points reviewers at the notebook. If the notebook drifts from what the paper describes, reviewers running it will see numbers that don't match the paper — and that's a scientific integrity issue, not a software issue. The target hashes the notebook and asserts it matches a locked SHA-256. Phase 4 wires this into CI as a required check on main.

8. Why split optional deps into `[hf]`, `[eval]`, `[mlflow]`, `[dev]`?

The slim production image (backend:latest) does NOT need transformers, torch, pycocoevalcap, or MLflow. Bundling them adds ~1.5 GB of dependencies the production code never imports. Extras let pip install -e ".[hf]" add the HuggingFace baselines for the Phase 3 comparison demo, while pip install -r requirements.txt keeps the production install lean.

9. Why MIT license?

The IEEE paper is published under IEEE's standard terms; the code is covered separately. MIT is the most permissive widely recognised license — it lets recruiters, students, and other researchers freely fork, learn from, and extend the code. For a recruiter-grade portfolio project, permissive licensing signals "I want this work to be useful," which is the right tone.

10. Why folder name `configs/` (plural), not `config/` (singular)?

config/ was the empty folder shipped with the template. The plural form configs/ is the convention in modern Python ML projects (FastAPI's own example apps, Hydra projects, the official transformers repo) because it holds multiple files (one per environment, model variant, or run). Phase 1 creates configs/ with content; the empty config/ folder will be removed in the Phase 1 commit that introduces the YAML files.

What this phase deliberately does NOT do

No code is moved out of the notebook yet. That's Phase 1, behind a parity validation gate.
No src/captioning/ modules are created. Empty __init__.py files would just be churn; Phase 1 will create them with real code.
No Dockerfile or docker-compose.yml. They depend on backend/app/ existing; both arrive in Phase 1.
No GitHub Actions workflows. They live in Phase 2, after there is Python code to lint and type-check.
No README rewrite. The current README accurately describes the research; the demo-link rewrite happens in Phase 2 once a live URL exists.

This restraint is deliberate. Each phase ships a coherent slice of value; running ahead would create half-built features and vague commits.

Local setup checklist for the developer

After pulling this commit, on a fresh dev box:

# 1. Create a Python 3.10 virtual environment.
python -m venv .venv
.venv\Scripts\activate              # PowerShell
# source .venv/bin/activate         # Linux/macOS

# 2. Install dev dependencies + the package (editable).
make install-dev
# Or, without Make:
#   pip install -r requirements-dev.txt -r requirements-eval.txt
#   pip install -e ".[hf,mlflow]"

# 3. Register pre-commit hooks.
make install-hooks
# Or:  pre-commit install

# 4. (Optional) Lock the paper notebook's hash, so CI can enforce parity.
make lock-paper-notebook

# 5. Verify everything works.
make pre-commit                     # Run all hooks against all files
make test                           # No tests yet — exits cleanly with "no tests collected"

The first make install-dev will take a few minutes (TensorFlow is large). Subsequent runs hit the wheel cache and complete in seconds.

Phase 0 — Bootstrap (decision log)

What this phase delivers

Decisions and reasoning

1. Why src/ layout over flat layout?

2. Why pyproject.toml AND requirements.txt?

3. Why pin tensorflow-cpu==2.15.0 so hard?