File size: 9,060 Bytes
b2594db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
# Phase 0 β€” Bootstrap (decision log)

> Phase 0 establishes the engineering scaffolding the rest of the project will
> stand on. Nothing here changes the model; everything here changes how the
> repo *looks and behaves* to the next person who clones it (including
> recruiters and CI runners).

## What this phase delivers

| Artefact | Purpose |
|---|---|
| [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](../notebooks/01_ieee_inceptionv3_transformer.ipynb) | Renamed from `image-captionin-using-dl.ipynb` via `git mv` to preserve history. Now the canonical, frozen IEEE artefact. |
| [`notebooks/README.md`](../notebooks/README.md) | Documents the frozen-notebook policy and conventions for any new notebooks. |
| [`pyproject.toml`](../pyproject.toml) | Single source of truth for the `captioning` Python package, dependency groups, and tool config (ruff/mypy/pytest/coverage). |
| [`requirements.txt`](../requirements.txt) | Pinned runtime deps, used directly by Docker and CI (mirrors `[project.dependencies]`). |
| [`requirements-dev.txt`](../requirements-dev.txt) | Pinned dev deps (lint, type-check, test, hooks). |
| [`requirements-eval.txt`](../requirements-eval.txt) | Pinned metric deps, kept separate to avoid bloating the serving image. |
| [`.python-version`](../.python-version) | Pins Python 3.10 for `pyenv` users. |
| [`.env.example`](../.env.example) | Schema for `pydantic-settings`-loaded env vars. |
| [`.pre-commit-config.yaml`](../.pre-commit-config.yaml) | Hooks: ruff, mypy, nbstripout, prettier (frontend), gitleaks. |
| [`Makefile`](../Makefile) | Discoverable command index (`make help`). |
| [`LICENSE`](../LICENSE) | MIT license, attribution to original author. |
| [`.gitignore`](../.gitignore) | Production-grade exclusions, organised by purpose with explanatory comments. |
| [`docs/restructure-plan.md`](./restructure-plan.md) | Public-facing engineering plan for Phases 0–4. |

---

## Decisions and reasoning

### 1. Why `src/` layout over flat layout?

A flat layout (`captioning/` at repo root) lets test code accidentally import
from the working tree instead of the *installed* package. That hides bugs that
would only surface in production, where the tree layout is gone. The `src/`
layout forces every test, every script, and every import to go through the
installed package β€” exactly the path users will follow. This is the layout
the [Python Packaging Authority recommends](https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/),
and it's what production Python codebases (FastAPI, Pydantic, HTTPX) use.

### 2. Why `pyproject.toml` AND `requirements.txt`?

They serve different audiences:

- **`pyproject.toml`** is the *source of truth* for the package β€” its name,
  version, abstract dependency ranges, optional extras, and tool configuration.
  When you `pip install -e .[dev]`, this is what pip reads.
- **`requirements.txt`** is the *concretely pinned snapshot* β€” used by Docker
  builds, CI runners, and anyone who wants `pip install -r requirements.txt`
  without cloning the source. It's regenerable from `pyproject.toml` via
  `pip-compile`, but committing it explicitly makes installs deterministic and
  diffable.

Phase 5+ will switch to `pip-compile` for automated regeneration; for now,
manual mirroring is simpler and beginner-readable.

### 3. Why pin `tensorflow-cpu==2.15.0` so hard?

Two independent reasons stack:

1. **`tensorflow-cpu` (not `tensorflow`)**: the GPU build pulls ~600 MB of
   CUDA libraries that are useless on CPU-only HuggingFace Spaces. Splitting
   the wheel keeps the serving image well under 1.5 GB.
2. **2.15 specifically**: TF 2.16 swapped to Keras 3 by default. The IEEE
   notebook uses `tf.keras.layers.TextVectorization` with the Keras 2
   save/load API. Upgrading silently changes vocab serialisation, which
   silently changes BLEU. Pinning is the difference between
   *reproducible-published-result* and *reproducibility theatre*.

When Phase 5+ migrates to a modern multimodal backbone, this pin will move
in a deliberate, tested step β€” not by accident.

### 4. Why Ruff over Black + isort + flake8?

Ruff replaces all three with one tool that runs ~100x faster, reads config
from a single section in `pyproject.toml`, and ships its own formatter
(`ruff format`) that is byte-identical to Black's output. One install, one
config, one cache. Recruiters reading the repo see the modern Python tool;
CI runs faster; `make format` is one command, not three.

### 5. Why `nbstripout` is non-negotiable in pre-commit

Notebook outputs include base64-encoded images, full DataFrames, and
sometimes credentials printed by accident. Committed notebook diffs without
output stripping are unreadable (`+aaaaaaaaaa[base64]+aaaaa…`) and
occasionally leak data. `nbstripout` removes all output cells on commit,
keeping notebook history clean and reviewable.

### 6. Why include a `Makefile` on a Windows project?

Three reasons:

1. **CI runs on Linux** β€” every CI job uses the same Make targets, so the
   commands you run locally match what CI runs.
2. **Discoverability** β€” `make help` is one command that prints every
   high-level operation with a one-line description. A new contributor (or
   recruiter cloning the repo) sees the entire workflow in one screen.
3. **Tooling availability** β€” Make is a 5-second install on Windows
   (`winget install GnuWin32.Make`, Git Bash, or WSL). PowerShell users who
   skip Make can still read the Makefile and run the underlying commands
   directly.

### 7. Why a `freeze-paper-notebook` Make target?

The IEEE paper points reviewers at the notebook. If the notebook drifts from
what the paper describes, reviewers running it will see numbers that don't
match the paper β€” and that's a scientific integrity issue, not a software
issue. The target hashes the notebook and asserts it matches a locked
SHA-256. Phase 4 wires this into CI as a required check on `main`.

### 8. Why split optional deps into `[hf]`, `[eval]`, `[mlflow]`, `[dev]`?

The slim production image (`backend:latest`) does NOT need transformers,
torch, pycocoevalcap, or MLflow. Bundling them adds ~1.5 GB of dependencies
the production code never imports. Extras let `pip install -e ".[hf]"` add
the HuggingFace baselines for the Phase 3 comparison demo, while
`pip install -r requirements.txt` keeps the production install lean.

### 9. Why MIT license?

The IEEE paper is published under IEEE's standard terms; the *code* is
covered separately. MIT is the most permissive widely recognised license β€”
it lets recruiters, students, and other researchers freely fork, learn from,
and extend the code. For a recruiter-grade portfolio project, permissive
licensing signals "I want this work to be useful," which is the right tone.

### 10. Why folder name `configs/` (plural), not `config/` (singular)?

`config/` was the empty folder shipped with the template. The plural form
`configs/` is the convention in modern Python ML projects (FastAPI's own
example apps, Hydra projects, the official `transformers` repo) because
it holds multiple files (one per environment, model variant, or run).
Phase 1 creates `configs/` with content; the empty `config/` folder will
be removed in the Phase 1 commit that introduces the YAML files.

---

## What this phase deliberately does NOT do

- **No code is moved out of the notebook yet.** That's Phase 1, behind a
  parity validation gate.
- **No `src/captioning/` modules are created.** Empty `__init__.py` files
  would just be churn; Phase 1 will create them with real code.
- **No Dockerfile or docker-compose.yml.** They depend on `backend/app/`
  existing; both arrive in Phase 1.
- **No GitHub Actions workflows.** They live in Phase 2, after there is
  Python code to lint and type-check.
- **No README rewrite.** The current README accurately describes the
  research; the demo-link rewrite happens in Phase 2 once a live URL exists.

This restraint is deliberate. Each phase ships a coherent slice of value;
running ahead would create half-built features and vague commits.

---

## Local setup checklist for the developer

After pulling this commit, on a fresh dev box:

```bash
# 1. Create a Python 3.10 virtual environment.
python -m venv .venv
.venv\Scripts\activate              # PowerShell
# source .venv/bin/activate         # Linux/macOS

# 2. Install dev dependencies + the package (editable).
make install-dev
# Or, without Make:
#   pip install -r requirements-dev.txt -r requirements-eval.txt
#   pip install -e ".[hf,mlflow]"

# 3. Register pre-commit hooks.
make install-hooks
# Or:  pre-commit install

# 4. (Optional) Lock the paper notebook's hash, so CI can enforce parity.
make lock-paper-notebook

# 5. Verify everything works.
make pre-commit                     # Run all hooks against all files
make test                           # No tests yet β€” exits cleanly with "no tests collected"
```

The first `make install-dev` will take a few minutes (TensorFlow is large).
Subsequent runs hit the wheel cache and complete in seconds.