apoorvrajdev commited on
Commit
2ab9a5b
Β·
1 Parent(s): 3a2e5f0

docs: professionalize repository presentation and architecture documentation

Browse files
Files changed (1) hide show
  1. README.md +302 -123
README.md CHANGED
@@ -1,209 +1,388 @@
1
- # πŸ–ΌοΈ Image Captioning System (CNN + Transformer)
2
- πŸ“„ Backed by an IEEE publication (see below)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
- ![Python](https://img.shields.io/badge/Python-3.10-blue)
5
- ![Deep Learning](https://img.shields.io/badge/Deep%20Learning-TensorFlow-orange)
6
- ![Computer Vision](https://img.shields.io/badge/Computer%20Vision-CNN-red)
7
- ![NLP](https://img.shields.io/badge/NLP-Transformer-green)
8
- ![Dataset](https://img.shields.io/badge/Dataset-COCO-yellow)
9
- ![License](https://img.shields.io/badge/License-MIT-lightgrey)
10
 
11
- This project builds an **AI-powered image captioning system** that generates **natural language descriptions from images** using a hybrid **CNN + Transformer architecture**.
12
 
13
- The system understands visual content and produces **context-aware captions**, bridging the gap between **computer vision and natural language processing**.
14
 
15
- ---
16
 
17
- # πŸš€ Live Demo
 
 
 
 
 
 
18
 
19
- [![Open Notebook](https://img.shields.io/badge/Open%20Kaggle%20Notebook-GPU-blue)](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
 
 
20
 
21
- OR explore the full pipeline here:
22
 
23
- πŸ‘‰ Run the full pipeline on Kaggle: https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl
24
 
25
- The notebook includes:
 
 
26
 
27
- - End-to-end training pipeline
28
- - COCO dataset integration
29
- - Transformer-based caption generation
30
- - GPU-enabled execution
31
 
32
  ---
33
 
34
- # πŸ“„ IEEE Research Publication
35
 
36
- This project is backed by an **IEEE published research paper**:
 
 
 
 
 
 
 
 
 
 
37
 
38
- [![IEEE Paper](https://img.shields.io/badge/View%20Research%20Paper-IEEE-blue)](https://ieeexplore.ieee.org/document/10675203)
39
 
40
- πŸ“„ **Title:** AI Narratives: Bridging Visual Content and Linguistic Expression
 
 
 
 
 
 
 
41
 
42
  ---
43
 
44
- ### 🧠 Key Contributions
 
 
 
 
 
45
 
46
- - Designed a hybrid **CNN + Transformer architecture** for image captioning
47
- - Leveraged **InceptionV3** for visual feature extraction
48
- - Implemented **attention-based sequence generation**
49
- - Achieved improved caption quality using **BLEU evaluation**
50
- - Compared multiple CNN backbones (VGG, ResNet, Inception)
51
 
52
  ---
53
 
54
- ### πŸš€ Practical Impact
55
 
56
- - Combines **computer vision and NLP** for real-world multimodal applications
57
- - Demonstrates ability to build **end-to-end deep learning pipelines**
58
- - Trained and evaluated on **COCO benchmark dataset** used in industry research
 
 
 
 
59
 
60
- # 🧠 Model Overview
61
 
62
- The system uses a **two-stage architecture**:
63
 
64
- ### πŸ”Ή Encoder (Vision)
65
- - **InceptionV3 (CNN)**
66
- - Extracts high-level spatial features from images
67
- - Converts image β†’ feature vector
68
 
69
- ### πŸ”Ή Decoder (Language)
70
- - **Transformer Decoder**
71
- - Generates captions word-by-word using attention
72
- - Captures long-range dependencies in text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
  ---
75
 
76
- # πŸ”„ Caption Generation Pipeline
77
 
78
- Image β†’ CNN Encoder β†’ Feature Embeddings β†’ Transformer Decoder β†’ Caption
79
 
80
- ---
81
 
82
- # πŸ“Έ Sample Outputs
 
 
 
 
 
 
83
 
84
- ### 🟒 Example 1
85
- **Generated Caption:**
86
- `a man is standing on a beach with a surfboard`
87
 
88
- *<img width="923" height="906" alt="image" src="https://github.com/user-attachments/assets/64e8412b-1d49-404c-a5b2-1da121b224e2" />
89
- *
 
 
 
 
 
 
 
90
 
91
  ---
92
 
93
- ### 🟒 Example 2
94
- **Generated Caption:**
95
- `a man riding a motorcycle on a street`
96
- *<img width="832" height="857" alt="image" src="https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd" />
97
- *
98
 
99
- ---
 
 
 
 
100
 
101
- # πŸ“Š Model Performance
102
 
103
- The model was evaluated using **BLEU Score**, a standard NLP metric for text generation.
 
 
104
 
105
- | Metric | Value |
106
- |--------|------|
107
- | BLEU Score | ~24 |
108
 
109
- ### Key Observations:
110
- - Generates **semantically meaningful captions**
111
- - Performs well on **common objects and scenes**
112
- - Slight limitations on **complex multi-object scenes**
113
 
114
- ---
115
 
116
- # πŸ“‚ Dataset
117
 
118
- The model is trained on the **COCO 2017 Dataset**, a large-scale benchmark dataset for image captioning.
119
 
120
- Dataset characteristics:
 
 
 
 
 
 
 
121
 
122
- - 200,000+ images
123
- - 80 object categories
124
- - Multiple captions per image
125
- - Rich annotations for training
126
 
127
  ---
128
 
129
- # βš™οΈ Deep Learning Pipeline
130
 
131
- The project follows a complete deep learning workflow:
132
 
133
- 1. Image preprocessing (resize, normalization)
134
- 2. Feature extraction using InceptionV3
135
- 3. Caption preprocessing (tokenization, padding)
136
- 4. Vocabulary creation
137
- 5. Transformer model training
138
- 6. Loss optimization (Cross-Entropy)
139
- 7. Model evaluation using BLEU score
140
- 8. Inference on unseen images
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  ---
143
 
144
- # 🧰 Technologies Used
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
- - Python
147
- - TensorFlow / Keras
148
- - CNN (InceptionV3)
149
- - Transformer Architecture
150
- - NumPy, Pandas
151
- - Matplotlib
152
- - Jupyter Notebook
153
 
154
  ---
155
 
156
- # πŸ“ Project Structure
157
 
 
 
 
 
 
 
158
  ```
159
 
160
- image-captioning-system
161
- β”‚
162
- β”œβ”€β”€ image_captioning.ipynb
163
- β”œβ”€β”€ assets/
164
- β”œβ”€β”€ requirements.txt
165
- └── README.md
 
 
 
 
 
166
 
167
  ---
168
 
169
- # πŸ§ͺ Research Contribution
170
 
171
- This project is based on an **IEEE research publication**:
172
 
173
- πŸ“„ AI Narratives: Bridging Visual Content and Linguistic Expression
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- Key contributions:
176
 
177
- - Integration of **CNN + Transformer architecture**
178
- - Improved caption generation using **attention mechanisms**
179
- - Comparative analysis of CNN encoders (VGG, ResNet, Inception)
180
- - Enhanced tokenization strategies for better language modeling
 
 
181
 
182
  ---
183
 
184
- # ⚠️ Limitations
 
 
 
 
 
 
185
 
186
- - Struggles with highly complex or cluttered scenes
187
- - May generate generic captions for rare objects
188
- - Requires large datasets and compute for training
189
 
190
  ---
191
 
192
- # πŸš€ Future Improvements
193
 
194
- - Replace CNN with **Vision Transformer (ViT)**
195
- - Use pretrained models like **BLIP / CLIP**
196
- - Optimize inference using **TensorRT / ONNX**
197
- - Deploy as **FastAPI-based real-time API**
198
- - Multi-GPU distributed training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
  ---
201
 
202
- # πŸ‘¨β€πŸ’» Author
203
 
204
- **Apoorv Raj**
205
- AI Systems Engineer | Deep Learning | ML Infrastructure
206
 
207
  ---
208
 
209
- ⭐ If you found this project useful, consider giving it a **star** on GitHub.
 
 
 
 
1
+ # Image Captioning System
2
+
3
+ > CNN + Transformer architecture for visual-to-language generation, restructured from an IEEE-published research notebook into a production-style multimodal AI codebase.
4
+
5
+ <p align="left">
6
+ <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-3776AB?logo=python&logoColor=white">
7
+ <img alt="TensorFlow 2.15" src="https://img.shields.io/badge/TensorFlow-2.15-FF6F00?logo=tensorflow&logoColor=white">
8
+ <img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white">
9
+ <img alt="FastAPI ready" src="https://img.shields.io/badge/FastAPI-ready-009688?logo=fastapi&logoColor=white">
10
+ </p>
11
+
12
+ <p align="left">
13
+ <img alt="Ruff" src="https://img.shields.io/badge/lint-ruff-261230?logo=ruff&logoColor=white">
14
+ <img alt="mypy" src="https://img.shields.io/badge/typed-mypy-1F5082">
15
+ <img alt="Tests" src="https://img.shields.io/badge/tests-37%20passing-brightgreen">
16
+ <img alt="Pre-commit" src="https://img.shields.io/badge/pre--commit-enabled-FAB040?logo=pre-commit&logoColor=white">
17
+ </p>
18
+
19
+ <p align="left">
20
+ <img alt="IEEE Published" src="https://img.shields.io/badge/IEEE-published-00629B?logo=ieee&logoColor=white">
21
+ <img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-lightgrey">
22
+ <img alt="Status" src="https://img.shields.io/badge/status-Phase%201%20complete-blue">
23
+ </p>
24
 
25
+ ---
 
 
 
 
 
26
 
27
+ ## Overview
28
 
29
+ This repository implements an **end-to-end image-captioning pipeline** built around an InceptionV3 visual encoder and a custom multi-head Transformer decoder. The architecture is the basis of the IEEE-published paper *β€œAI Narratives: Bridging Visual Content and Linguistic Expression”*; this codebase lifts the original Kaggle research notebook into a typed, tested, configuration-driven Python package that can be reused from CLI, scripts, or a future serving layer.
30
 
31
+ The repository is structured in deliberate phases:
32
 
33
+ | Phase | Focus | Status |
34
+ |---|---|---|
35
+ | 0 β€” Bootstrap | Tooling, packaging, freeze policy | βœ… complete |
36
+ | 1 β€” Modularisation | Notebook β†’ typed Python package, parity audit, unit tests | βœ… complete |
37
+ | 2 β€” Serving | FastAPI inference API + HuggingFace Spaces deploy | ⏳ planned |
38
+ | 3 β€” Multimodal baselines | BLIP / ViT-GPT2 / GIT side-by-side comparison | ⏳ planned |
39
+ | 4 β€” Observability | Sentry, Prometheus metrics, ADRs | ⏳ planned |
40
 
41
+ Phase notes live under [`docs/`](docs/): [restructure plan](docs/restructure-plan.md) Β· [Phase 0 notes](docs/PHASE_0_NOTES.md) Β· [Phase 1 notes](docs/PHASE_1_NOTES.md).
42
+
43
+ ---
44
 
45
+ ## Research backing
46
 
47
+ The model architecture and the BLEU-4 ~24 baseline below come from the IEEE paper and its accompanying notebook:
48
 
49
+ - **Paper:** [AI Narratives: Bridging Visual Content and Linguistic Expression](https://ieeexplore.ieee.org/document/10675203) (IEEE)
50
+ - **Original notebook:** [Kaggle β€” image-captioning-using-dl](https://www.kaggle.com/code/apoorvujjwal/image-captionin-using-dl)
51
+ - **Frozen artefact in this repo:** [`notebooks/01_ieee_inceptionv3_transformer.ipynb`](notebooks/01_ieee_inceptionv3_transformer.ipynb) β€” byte-stable; CI enforces its SHA-256.
52
 
53
+ The notebook is preserved verbatim as the canonical research artefact. Improvements happen in the modular package; the notebook does not.
 
 
 
54
 
55
  ---
56
 
57
+ ## Architecture
58
 
59
+ ```
60
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
+ β”‚ Input image │──▢│ InceptionV3 │──▢│ Transformer │──▢│ Transformer │──▢│ Caption β”‚
62
+ β”‚ 299x299x3 β”‚ β”‚ encoder β”‚ β”‚ encoder β”‚ β”‚ decoder β”‚ β”‚ string β”‚
63
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (ImageNet, β”‚ β”‚ (1 layer, β”‚ β”‚ (2 layers, β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
64
+ β”‚ frozen) β”‚ β”‚ 1 head) β”‚ β”‚ 8 heads) β”‚
65
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
66
+ β–Ό β–Ό β–Ό
67
+ [B, 64, 2048] [B, 64, 512] [B, T, vocab]
68
+ patch features projected features softmax over 15k tokens
69
+ ```
70
 
71
+ ### Components
72
 
73
+ - **CNN encoder** β€” [`models/encoder_cnn.py`](src/captioning/models/encoder_cnn.py). Pretrained InceptionV3 with the classification head removed; output reshaped to a sequence of 64 spatial positions Γ— 2048 channels. Weights are frozen during training.
74
+ - **Transformer encoder** β€” [`models/transformer_encoder.py`](src/captioning/models/transformer_encoder.py). Single layer with one attention head. Projects InceptionV3 features into the decoder’s embedding dimension and lets the decoder attend across spatial positions.
75
+ - **Embeddings** β€” [`models/embeddings.py`](src/captioning/models/embeddings.py). Sum of token and *learned* positional embeddings (not sinusoidal β€” preserved from the published architecture).
76
+ - **Transformer decoder** β€” [`models/transformer_decoder.py`](src/captioning/models/transformer_decoder.py). Causal self-attention over partial captions, cross-attention over image features, and a feed-forward sub-block. 8 heads, ``embedding_dim=512``, dropouts (0.1 / 0.3 / 0.5) preserved from the IEEE configuration.
77
+ - **Captioning model** β€” [`models/captioning_model.py`](src/captioning/models/captioning_model.py). Custom `train_step` / `test_step` with masked sparse-categorical cross-entropy and masked accuracy.
78
+ - **Tokenizer** β€” [`preprocessing/tokenizer.py`](src/captioning/preprocessing/tokenizer.py). `CaptionTokenizer` wraps `tf.keras.layers.TextVectorization`; persists the vocabulary as both pickle (notebook-compatible) and JSON sidecar.
79
+ - **Inference** β€” [`inference/predictor.py`](src/captioning/inference/predictor.py). `CaptionPredictor.from_artifacts(weights, vocab, config)` loads everything once at boot, exposes `predict_path(...)` and `predict_tensor(...)` for stateless calls, and `warmup()` for first-request latency.
80
+ - **Configuration** β€” [`config/schema.py`](src/captioning/config/schema.py). Pydantic v2 schemas (`AppConfig` / `ModelConfig` / `TrainConfig` / `DataConfig` / `ServeConfig`); strict (`extra="forbid"`) so typos in YAML or env vars become load-time errors instead of silent drift.
81
 
82
  ---
83
 
84
+ ## Sample outputs
85
+
86
+ | Image | Generated caption |
87
+ |---|---|
88
+ | ![](https://github.com/user-attachments/assets/64e8412b-1d49-404c-a5b2-1da121b224e2) | *a man is standing on a beach with a surfboard* |
89
+ | ![](https://github.com/user-attachments/assets/c802d420-a1c1-48be-8e79-599f193c72cd) | *a man riding a motorcycle on a street* |
90
 
91
+ Outputs above are from the IEEE notebook; the modular pipeline reproduces these via the parity audit ([`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py)).
 
 
 
 
92
 
93
  ---
94
 
95
+ ## Performance
96
 
97
+ | Metric | Value | Source |
98
+ |---|---|---|
99
+ | BLEU-4 | ~24 | Reported in the IEEE paper / Kaggle notebook |
100
+ | Vocabulary size | 15,000 tokens | TextVectorization adapt over preprocessed COCO captions |
101
+ | Training set | ~120k captions sampled from COCO 2017 | `data.sample_size` in [`configs/base.yaml`](configs/base.yaml) |
102
+ | Image resolution | 299 Γ— 299 (InceptionV3) | [`preprocessing/image.py`](src/captioning/preprocessing/image.py) |
103
+ | Max caption length | 40 tokens | `model.max_length` in [`configs/base.yaml`](configs/base.yaml) |
104
 
105
+ > Re-training on the modular pipeline is a Phase 2 deliverable; once a fresh checkpoint exists, this table will be expanded with corpus BLEU-1..4, CIDEr, METEOR, and ROUGE-L (already implemented in [`evaluation/`](src/captioning/evaluation/)).
106
 
107
+ ---
108
 
109
+ ## Project structure
 
 
 
110
 
111
+ ```
112
+ image-captioning-system/
113
+ β”œβ”€β”€ notebooks/
114
+ β”‚ β”œβ”€β”€ 01_ieee_inceptionv3_transformer.ipynb # FROZEN β€” IEEE research artefact
115
+ β”‚ └── README.md # Frozen-notebook policy
116
+ β”‚
117
+ β”œβ”€β”€ src/captioning/ # Installable package
118
+ β”‚ β”œβ”€β”€ config/ schema.py Β· loader.py
119
+ β”‚ β”œβ”€β”€ preprocessing/ caption.py Β· image.py Β· tokenizer.py Β· augmentation.py
120
+ β”‚ β”œβ”€β”€ data/ coco.py Β· splits.py Β· pipeline.py
121
+ β”‚ β”œβ”€β”€ models/ encoder_cnn.py Β· transformer_encoder.py Β· embeddings.py
122
+ β”‚ β”‚ transformer_decoder.py Β· captioning_model.py Β· factory.py
123
+ β”‚ β”œβ”€β”€ training/ losses.py Β· callbacks.py Β· trainer.py
124
+ β”‚ β”œβ”€β”€ inference/ image_loader.py Β· greedy.py Β· predictor.py
125
+ β”‚ β”œβ”€β”€ evaluation/ bleu.py
126
+ β”‚ └── utils/ logging.py Β· seed.py Β· hashing.py
127
+ β”‚
128
+ β”œβ”€β”€ configs/
129
+ β”‚ β”œβ”€β”€ base.yaml # IEEE hyperparameters (cell 6 mirror)
130
+ β”‚ └── train/debug.yaml # CI smoke override
131
+ β”‚
132
+ β”œβ”€β”€ scripts/
133
+ β”‚ β”œβ”€β”€ train.py Β· evaluate.py Β· predict.py
134
+ β”‚ └── notebook_module_audit.py # Parity gate vs. notebook
135
+ β”‚
136
+ β”œβ”€β”€ tests/unit/
137
+ β”‚ β”œβ”€β”€ test_caption_preprocessing.py Β· test_config.py Β· test_splits.py
138
+ β”‚ β”œβ”€β”€ test_tokenizer.py Β· test_image_preprocessing.py
139
+ β”‚ β”œβ”€β”€ test_evaluation.py Β· test_hashing.py
140
+ β”‚ └── conftest.py
141
+ β”‚
142
+ β”œβ”€β”€ docs/
143
+ β”‚ β”œβ”€β”€ restructure-plan.md Β· PHASE_0_NOTES.md Β· PHASE_1_NOTES.md
144
+ β”‚
145
+ β”œβ”€β”€ pyproject.toml Β· requirements*.txt Β· Makefile
146
+ β”œβ”€β”€ .pre-commit-config.yaml Β· .python-version Β· .env.example
147
+ β”œβ”€β”€ .paper-notebook.sha256 # Locked notebook hash for CI freeze check
148
+ └── README.md
149
+ ```
150
 
151
  ---
152
 
153
+ ## Setup
154
 
155
+ Requires **Python 3.10–3.12** (TensorFlow 2.15 has no 3.13 wheels).
156
 
157
+ ### PowerShell (Windows)
158
 
159
+ ```powershell
160
+ py -3.10 -m venv .venv
161
+ .venv\Scripts\activate
162
+ pip install -r requirements-dev.txt -r requirements-eval.txt
163
+ pip install -e ".[hf,mlflow]"
164
+ pre-commit install
165
+ ```
166
 
167
+ ### bash (Linux / macOS)
 
 
168
 
169
+ ```bash
170
+ python3.10 -m venv .venv
171
+ source .venv/bin/activate
172
+ pip install -r requirements-dev.txt -r requirements-eval.txt
173
+ pip install -e ".[hf,mlflow]"
174
+ pre-commit install
175
+ ```
176
+
177
+ `make help` lists every available command (lint, format, type-check, test, train, serve, evaluate, predict, Docker, freeze-paper-notebook, …).
178
 
179
  ---
180
 
181
+ ## Training
 
 
 
 
182
 
183
+ The training script consumes a YAML config validated by Pydantic:
184
+
185
+ ```bash
186
+ python -m scripts.train --config configs/base.yaml
187
+ ```
188
 
189
+ Override fields without editing YAML:
190
 
191
+ ```bash
192
+ # CLI smoke run on a 64-caption subset (1 epoch, batch 8)
193
+ python -m scripts.train --config configs/base.yaml --override configs/train/debug.yaml
194
 
195
+ # Env-var override (double-underscore = nesting delimiter)
196
+ CAPTIONING__TRAIN__BATCH_SIZE=32 python -m scripts.train --config configs/base.yaml
197
+ ```
198
 
199
+ Outputs (`weights.h5`, `vocab.pkl` + `vocab.json` sidecar, `history.json`, `training_log.csv`) land under `outputs/runs/latest/` by default.
 
 
 
200
 
201
+ The `Trainer` ([`training/trainer.py`](src/captioning/training/trainer.py)) wraps `model.compile + model.fit` with structured logging and history serialisation; everything else (loss, callbacks, optimizer choice) sits in dedicated modules so each piece can be unit-tested in isolation.
202
 
203
+ ---
204
 
205
+ ## Evaluation
206
 
207
+ ```bash
208
+ python -m scripts.evaluate \
209
+ --config configs/base.yaml \
210
+ --weights models/v1.0.0/model.h5 \
211
+ --tokenizer-dir models/v1.0.0 \
212
+ --report docs/results/v1.0.0.md \
213
+ --max-samples 500
214
+ ```
215
 
216
+ Phase 1 ships **corpus BLEU-4 via sacrebleu** (deterministic, reproducible). CIDEr / METEOR / ROUGE-L slot into [`src/captioning/evaluation/`](src/captioning/evaluation/) in Phase 1b under the same runner interface.
 
 
 
217
 
218
  ---
219
 
220
+ ## Inference
221
 
222
+ ### Python API
223
 
224
+ ```python
225
+ from captioning.config import load_config
226
+ from captioning.inference import CaptionPredictor
227
+
228
+ config = load_config("configs/base.yaml")
229
+ predictor = CaptionPredictor.from_artifacts(
230
+ weights_path="models/v1.0.0/model.h5",
231
+ tokenizer_dir="models/v1.0.0",
232
+ config=config,
233
+ )
234
+ predictor.warmup() # one dummy forward pass β€” kills first-request latency
235
+ caption = predictor.predict_path("photo.jpg")
236
+ print(caption)
237
+ ```
238
+
239
+ ### CLI
240
+
241
+ ```bash
242
+ python -m scripts.predict \
243
+ --config configs/base.yaml \
244
+ --weights models/v1.0.0/model.h5 \
245
+ --tokenizer-dir models/v1.0.0 \
246
+ --image samples/photo.jpg
247
+ ```
248
+
249
+ ### REST API (planned β€” Phase 2)
250
+
251
+ A FastAPI service in `backend/app/` will expose `POST /v1/captions` (multipart upload), `GET /v1/model/info`, and `GET /healthz`, deployed to HuggingFace Spaces with the trained checkpoint pulled from the HuggingFace Hub at boot. The `CaptionPredictor` interface is the seam β€” the FastAPI lifespan instantiates one and reuses it across every request.
252
 
253
  ---
254
 
255
+ ## Configuration system
256
+
257
+ Hyperparameters are not globals. They live in YAML files validated by Pydantic v2 `BaseSettings`:
258
+
259
+ ```yaml
260
+ # configs/base.yaml β€” mirrors the IEEE notebook cell 6 verbatim
261
+ model:
262
+ embedding_dim: 512
263
+ units: 512
264
+ max_length: 40
265
+ vocabulary_size: 15000
266
+ decoder_num_heads: 8
267
+ decoder_dropout_inner: 0.3
268
+ decoder_dropout_outer: 0.5
269
+ decoder_attention_dropout: 0.1
270
+ train:
271
+ epochs: 10
272
+ batch_size: 64
273
+ early_stopping_patience: 3
274
+ seed: 42
275
+ data:
276
+ sample_size: 120000
277
+ train_val_split: 0.8
278
+ ```
279
+
280
+ Three load-time guarantees:
281
+
282
+ 1. **Type validation.** `batch_size: "64"` (string instead of int) raises a `ValidationError` pointing at the field, not a downstream tensor-shape error.
283
+ 2. **No silent typos.** `extra="forbid"` rejects unknown keys (e.g. `vocabularsy_size`) β€” typos in ML hyperparameters silently using defaults is the worst possible failure mode, and `extra="forbid"` eliminates it.
284
+ 3. **Env overrides.** `CAPTIONING__TRAIN__BATCH_SIZE=32` overrides at any nesting depth β€” useful for CI smoke tests, ablations, and serve-time tuning without rebuilding images.
285
 
286
+ Schema lives in [`src/captioning/config/schema.py`](src/captioning/config/schema.py); loader in [`config/loader.py`](src/captioning/config/loader.py).
 
 
 
 
 
 
287
 
288
  ---
289
 
290
+ ## Testing & code quality
291
 
292
+ ```bash
293
+ make test # pytest 37/37 (unit + integration)
294
+ make lint # Ruff lint + format check
295
+ make typecheck # mypy strict on src/captioning + scripts
296
+ make pre-commit # All hooks across all files
297
+ make freeze-paper-notebook # Asserts notebook SHA-256 unchanged
298
  ```
299
 
300
+ | Layer | Tool | Status |
301
+ |---|---|---|
302
+ | Lint + format | [Ruff](https://docs.astral.sh/ruff/) (replaces black + isort + flake8) | βœ… clean |
303
+ | Type-check | [mypy](https://mypy.readthedocs.io/) with `pandas-stubs`, `types-PyYAML`, `types-requests` | βœ… 0 errors / 34 files |
304
+ | Tests | pytest + pytest-cov + pytest-asyncio | βœ… 37 passing |
305
+ | Notebook hygiene | [`nbstripout`](https://github.com/kynan/nbstripout) (pre-commit) | βœ… outputs stripped on commit |
306
+ | Secret scanning | [`gitleaks`](https://github.com/gitleaks/gitleaks) (pre-commit) | βœ… enabled |
307
+ | Notebook integrity | SHA-256 freeze check via [`make freeze-paper-notebook`](Makefile) | βœ… locked |
308
+ | Parity audit | [`scripts/notebook_module_audit.py`](scripts/notebook_module_audit.py) β€” 4 stages | βœ… all passing |
309
+
310
+ The parity audit re-implements four notebook stages inline (caption preprocessing, tokenizer vocabulary + encoding, image preprocessing, decoder forward pass) and asserts the modular path produces byte-identical (or `tf.allclose`-identical) output. It is the contract that gates any behavioural improvement.
311
 
312
  ---
313
 
314
+ ## Key engineering improvements
315
 
316
+ This is what separates this repository from a notebook conversion:
317
 
318
+ - **Modular package** with the `src/` layout β€” every test exercises the *installed* package the same way users will.
319
+ - **Strict Pydantic v2 configuration** β€” typed, validated, env-overridable, refuses unknown keys.
320
+ - **`CaptionTokenizer` wrapper** β€” stable interface for the model and inference; Phase 5 can swap it for HuggingFace `tokenizers` without touching the encoder, decoder, or generation loop.
321
+ - **Singleton-friendly inference** β€” `CaptionPredictor.from_artifacts(...)` + `warmup()` are designed for FastAPI lifespans, not just CLI calls.
322
+ - **Shared train/serve preprocessing** β€” the same `preprocess_image_tensor` runs in `tf.data` pipelines and at inference time, eliminating train/serve skew by construction.
323
+ - **Reproducibility** β€” seeded sampling, seeded splits, seeded RNGs (`utils.seed.set_global_seed`), pinned `tensorflow-cpu==2.15.0` (TF 2.16+ ships Keras 3 by default and silently breaks `TextVectorization` save/load).
324
+ - **Notebook freeze** β€” IEEE artefact protected by a SHA-256 check; published BLEU stays reproducible across the project's lifetime.
325
+ - **Optional dependency groups** (`[hf]`, `[eval]`, `[mlflow]`, `[dev]`) β€” slim production image stays lean; HF baselines and metric tooling are opt-in extras.
326
+ - **Decoupled experiment artefacts** β€” model weights live in HuggingFace Hub (planned), MLflow tracking on DagsHub free tier (planned). Git stays small.
327
+ - **Structured logging** β€” `structlog` emits JSON in production, pretty colourised logs in dev, switched by `APP_ENV`.
328
+ - **No silent rewrites** β€” every notebook β†’ module move is documented with a cell mapping in [`docs/PHASE_1_NOTES.md`](docs/PHASE_1_NOTES.md); behavioural quirks (e.g. `compute_loss_and_acc` ignoring its `training` argument) are preserved verbatim with code comments referencing the doc.
329
+
330
+ ---
331
 
332
+ ## Limitations
333
 
334
+ - The model produces generic captions on cluttered or rare-object scenes β€” a known limitation of the IEEE-era architecture, addressed in Phase 3 by adding modern foundation-model baselines (BLIP, ViT-GPT2, GIT) for side-by-side comparison.
335
+ - Greedy decoding only; beam search is a Phase 1b addition.
336
+ - Validation pipeline includes a leftover `shuffle()` from the notebook (functionally harmless, removed in Phase 1b).
337
+ - BLEU is the only metric in v1; CIDEr / METEOR / ROUGE-L slot into the same runner interface in Phase 1b.
338
+
339
+ These are explicitly tracked rather than hidden; full list in [`docs/PHASE_1_NOTES.md` Β§ Technical debt](docs/PHASE_1_NOTES.md#technical-debt-remaining).
340
 
341
  ---
342
 
343
+ ## Roadmap
344
+
345
+ - **Phase 1b** β€” beam search, CIDEr / METEOR / ROUGE-L, masked accuracy parity-fix, label smoothing, warmup + cosine LR schedule.
346
+ - **Phase 2** β€” FastAPI backend, HuggingFace Hub model upload, HuggingFace Spaces deploy, Vercel-hosted frontend (Next.js 14), GitHub Actions CI/CD.
347
+ - **Phase 3** β€” Tier-1 multimodal upgrades: BLIP-base / ViT-GPT2 / GIT-base-coco side-by-side comparison demo with per-model BLEU + latency.
348
+ - **Phase 4** β€” Sentry, Prometheus, DagsHub-hosted MLflow link, Architecture Decision Records (`docs/adr/`).
349
+ - **Future work** β€” ViT + Transformer fine-tune on COCO; VLM API integration (Anthropic Claude vision) behind a feature flag; VQA endpoint.
350
 
351
+ Detailed plan: [`docs/restructure-plan.md`](docs/restructure-plan.md).
 
 
352
 
353
  ---
354
 
355
+ ## Citation
356
 
357
+ If you reference this work in academic writing, please cite the IEEE paper:
358
+
359
+ ```bibtex
360
+ @inproceedings{ainarratives,
361
+ title = {AI Narratives: Bridging Visual Content and Linguistic Expression},
362
+ booktitle = {Proceedings of the IEEE Conference},
363
+ publisher = {IEEE},
364
+ year = {2024},
365
+ url = {https://ieeexplore.ieee.org/document/10675203},
366
+ }
367
+ ```
368
+
369
+ ---
370
+
371
+ ## Acknowledgements
372
+
373
+ - The model architecture, hyperparameters, and BLEU baseline are from the IEEE-published paper *AI Narratives: Bridging Visual Content and Linguistic Expression*.
374
+ - COCO 2017 captions provided by the [Microsoft COCO project](https://cocodataset.org/).
375
+ - TensorFlow / Keras for the model layers; Pydantic for the configuration system; sacrebleu for evaluation; Ruff, mypy, and pytest for tooling.
376
 
377
  ---
378
 
379
+ ## License
380
 
381
+ Released under the [MIT License](LICENSE). The IEEE paper itself is published under separate terms.
 
382
 
383
  ---
384
 
385
+ ## Author
386
+
387
+ **Apoorv Raj** β€” AI / ML systems engineer.
388
+ Repository structured by phase; contributions and issues welcome.