Spaces:
Configuration error
Configuration error
File size: 9,060 Bytes
b2594db | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | # Phase 0 β Bootstrap (decision log)
> Phase 0 establishes the engineering scaffolding the rest of the project will
> stand on. Nothing here changes the model; everything here changes how the
> repo *looks and behaves* to the next person who clones it (including
> recruiters and CI runners).
## What this phase delivers
| Artefact | Purpose |
|---|---|
| [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb) | Renamed from `image-captionin-using-dl.ipynb` via `git mv` to preserve history. Now the canonical, frozen IEEE artefact. |
| [`notebooks/README.md`](../notebooks/README.md) | Documents the frozen-notebook policy and conventions for any new notebooks. |
| [`pyproject.toml`](../pyproject.toml) | Single source of truth for the `captioning` Python package, dependency groups, and tool config (ruff/mypy/pytest/coverage). |
| [`requirements.txt`](../requirements.txt) | Pinned runtime deps, used directly by Docker and CI (mirrors `[project.dependencies]`). |
| [`requirements-dev.txt`](../requirements-dev.txt) | Pinned dev deps (lint, type-check, test, hooks). |
| [`requirements-eval.txt`](../requirements-eval.txt) | Pinned metric deps, kept separate to avoid bloating the serving image. |
| [`.python-version`](../.python-version) | Pins Python 3.10 for `pyenv` users. |
| [`.env.example`](../.env.example) | Schema for `pydantic-settings`-loaded env vars. |
| [`.pre-commit-config.yaml`](../.pre-commit-config.yaml) | Hooks: ruff, mypy, nbstripout, prettier (frontend), gitleaks. |
| [`Makefile`](../Makefile) | Discoverable command index (`make help`). |
| [`LICENSE`](../LICENSE) | MIT license, attribution to original author. |
| [`.gitignore`](../.gitignore) | Production-grade exclusions, organised by purpose with explanatory comments. |
| [`docs/restructure-plan.md`](./restructure-plan.md) | Public-facing engineering plan for Phases 0β4. |
---
## Decisions and reasoning
### 1. Why `src/` layout over flat layout?
A flat layout (`captioning/` at repo root) lets test code accidentally import
from the working tree instead of the *installed* package. That hides bugs that
would only surface in production, where the tree layout is gone. The `src/`
layout forces every test, every script, and every import to go through the
installed package β exactly the path users will follow. This is the layout
the [Python Packaging Authority recommends](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/),
and it's what production Python codebases (FastAPI, Pydantic, HTTPX) use.
### 2. Why `pyproject.toml` AND `requirements.txt`?
They serve different audiences:
- **`pyproject.toml`** is the *source of truth* for the package β its name,
version, abstract dependency ranges, optional extras, and tool configuration.
When you `pip install -e .[dev]`, this is what pip reads.
- **`requirements.txt`** is the *concretely pinned snapshot* β used by Docker
builds, CI runners, and anyone who wants `pip install -r requirements.txt`
without cloning the source. It's regenerable from `pyproject.toml` via
`pip-compile`, but committing it explicitly makes installs deterministic and
diffable.
Phase 5+ will switch to `pip-compile` for automated regeneration; for now,
manual mirroring is simpler and beginner-readable.
### 3. Why pin `tensorflow-cpu==2.15.0` so hard?
Two independent reasons stack:
1. **`tensorflow-cpu` (not `tensorflow`)**: the GPU build pulls ~600 MB of
CUDA libraries that are useless on CPU-only HuggingFace Spaces. Splitting
the wheel keeps the serving image well under 1.5 GB.
2. **2.15 specifically**: TF 2.16 swapped to Keras 3 by default. The IEEE
notebook uses `tf.keras.layers.TextVectorization` with the Keras 2
save/load API. Upgrading silently changes vocab serialisation, which
silently changes BLEU. Pinning is the difference between
*reproducible-published-result* and *reproducibility theatre*.
When Phase 5+ migrates to a modern multimodal backbone, this pin will move
in a deliberate, tested step β not by accident.
### 4. Why Ruff over Black + isort + flake8?
Ruff replaces all three with one tool that runs ~100x faster, reads config
from a single section in `pyproject.toml`, and ships its own formatter
(`ruff format`) that is byte-identical to Black's output. One install, one
config, one cache. Recruiters reading the repo see the modern Python tool;
CI runs faster; `make format` is one command, not three.
### 5. Why `nbstripout` is non-negotiable in pre-commit
Notebook outputs include base64-encoded images, full DataFrames, and
sometimes credentials printed by accident. Committed notebook diffs without
output stripping are unreadable (`+aaaaaaaaaa[base64]+aaaaaβ¦`) and
occasionally leak data. `nbstripout` removes all output cells on commit,
keeping notebook history clean and reviewable.
### 6. Why include a `Makefile` on a Windows project?
Three reasons:
1. **CI runs on Linux** β every CI job uses the same Make targets, so the
commands you run locally match what CI runs.
2. **Discoverability** β `make help` is one command that prints every
high-level operation with a one-line description. A new contributor (or
recruiter cloning the repo) sees the entire workflow in one screen.
3. **Tooling availability** β Make is a 5-second install on Windows
(`winget install GnuWin32.Make`, Git Bash, or WSL). PowerShell users who
skip Make can still read the Makefile and run the underlying commands
directly.
### 7. Why a `freeze-paper-notebook` Make target?
The IEEE paper points reviewers at the notebook. If the notebook drifts from
what the paper describes, reviewers running it will see numbers that don't
match the paper β and that's a scientific integrity issue, not a software
issue. The target hashes the notebook and asserts it matches a locked
SHA-256. Phase 4 wires this into CI as a required check on `main`.
### 8. Why split optional deps into `[hf]`, `[eval]`, `[mlflow]`, `[dev]`?
The slim production image (`backend:latest`) does NOT need transformers,
torch, pycocoevalcap, or MLflow. Bundling them adds ~1.5 GB of dependencies
the production code never imports. Extras let `pip install -e ".[hf]"` add
the HuggingFace baselines for the Phase 3 comparison demo, while
`pip install -r requirements.txt` keeps the production install lean.
### 9. Why MIT license?
The IEEE paper is published under IEEE's standard terms; the *code* is
covered separately. MIT is the most permissive widely recognised license β
it lets recruiters, students, and other researchers freely fork, learn from,
and extend the code. For a recruiter-grade portfolio project, permissive
licensing signals "I want this work to be useful," which is the right tone.
### 10. Why folder name `configs/` (plural), not `config/` (singular)?
`config/` was the empty folder shipped with the template. The plural form
`configs/` is the convention in modern Python ML projects (FastAPI's own
example apps, Hydra projects, the official `transformers` repo) because
it holds multiple files (one per environment, model variant, or run).
Phase 1 creates `configs/` with content; the empty `config/` folder will
be removed in the Phase 1 commit that introduces the YAML files.
---
## What this phase deliberately does NOT do
- **No code is moved out of the notebook yet.** That's Phase 1, behind a
parity validation gate.
- **No `src/captioning/` modules are created.** Empty `__init__.py` files
would just be churn; Phase 1 will create them with real code.
- **No Dockerfile or docker-compose.yml.** They depend on `backend/app/`
existing; both arrive in Phase 1.
- **No GitHub Actions workflows.** They live in Phase 2, after there is
Python code to lint and type-check.
- **No README rewrite.** The current README accurately describes the
research; the demo-link rewrite happens in Phase 2 once a live URL exists.
This restraint is deliberate. Each phase ships a coherent slice of value;
running ahead would create half-built features and vague commits.
---
## Local setup checklist for the developer
After pulling this commit, on a fresh dev box:
```bash
# 1. Create a Python 3.10 virtual environment.
python -m venv .venv
.venv\Scripts\activate # PowerShell
# source .venv/bin/activate # Linux/macOS
# 2. Install dev dependencies + the package (editable).
make install-dev
# Or, without Make:
# pip install -r requirements-dev.txt -r requirements-eval.txt
# pip install -e ".[hf,mlflow]"
# 3. Register pre-commit hooks.
make install-hooks
# Or: pre-commit install
# 4. (Optional) Lock the paper notebook's hash, so CI can enforce parity.
make lock-paper-notebook
# 5. Verify everything works.
make pre-commit # Run all hooks against all files
make test # No tests yet β exits cleanly with "no tests collected"
```
The first `make install-dev` will take a few minutes (TensorFlow is large).
Subsequent runs hit the wheel cache and complete in seconds.
|