Spaces:

adisaljusi
/

forkcast

Sleeping

App Files Files Community

forkcast / docs /documentation.md

adisaljusi

docs: add dataset bias analysis (gender x class) to section 5.1

5ee9f8b 9 days ago

preview code

raw

history blame contribute delete

19.2 kB

	# ForkCast — Project Documentation

	## Project Metadata

	\| Field \| Value \|
	\|---\|---\|
	\| Project title \| ForkCast — AI Nutrition Coach \|
	\| Student name \| Adis Aljusi \|
	\| GitHub repository URL \| _added on submission_ \|
	\| Deployment URL \| _added once the Space is live_ \|
	\| Submission date \| 07 June 2026 \|

	Mandatory setup checks
	- [x] At least 2 blocks selected (Computer Vision + ML Numeric + NLP — all three)
	- [x] Multiple and different data sources used (UCI Obesity, Food-101 / `nateraw/food`, USDA-derived nutrition lookup, hand-curated WHO/EFSA/USDA/NHS/Harvard guideline corpus)
	- [x] Deployment URL provided (Hugging Face Space, Gradio SDK)
	- [x] Required GitHub users added (`jasminh`, `bkuehnis`)

	---

	## 1. Project Foundation

	### 1.1 Problem Definition

	- Problem statement. Casual meal-tracking apps stop at logging calories. They rarely place that single meal in the context of the user's broader profile or explain the resulting risk in terms a non-expert can act on.
	- Goal. Take a meal photo plus a short profile and produce three things: (a) an estimated nutrition vector for the meal, (b) a predicted obesity-risk band and BMI for the user, and (c) a plain-language coaching explanation grounded in dietary guidelines.
	- Success criteria.
	- The CV block reaches sensible top-1 / top-5 coverage on Food-101 dishes and provides a usable nutrition vector even on uncertain predictions.
	- The numeric block achieves macro-F1 ≥ 0.85 on the held-out test fold and BMI MAE ≤ 3.0.
	- The RAG coach retrieves a relevant guideline chunk on ≥ 75 % of evaluation queries and produces faithful, citation-anchored prose.
	- End-to-end pipeline runs in < 5 s on CPU per request after warm-up.

	### 1.2 Integration Logic

	```
	[Meal photo] -- CV --> nutrition vector ----+
	\|
	v
	HighCaloricMeal? --> overrides FAVC
	\|
	v
	[Profile + habits form] ----------- Numeric block (BMI regressor + obesity classifier)
	\|
	v
	obesity_class, probability, BMI, kcal target
	\|
	v
	NLP RAG coach (TF-IDF / Dense + Template / OpenAI)
	\|
	v
	grounded plain-language explanation
	```

	Three concrete integration points, each more than co-execution:

	1. CV → Numeric (derived feature). When the uploaded meal exceeds the high-caloric threshold (≥ 700 kcal and ≥ 30 g fat), the numeric model's `FAVC_yes` / `FAVC_no` columns are flipped before scoring. The CV signal overrides the user's self-reported "frequent high-caloric food" answer. Implemented in `src/numeric/obesity.py::apply_favc_override` and consumed by `src/numeric/model.py::predict`.
	2. Numeric → NLP (model outputs as grounding). The obesity class, per-class probabilities, predicted BMI, and FAVC-override flag become structured context in the RAG prompt. The coach's explanation is anchored in the user's actual prediction, not a generic statement.
	3. CV → NLP (shared data). Meal nutrition is forwarded to the RAG query and the prompt body so the coach can refer to the specific dish (e.g. "this 850-kcal lasagna").

	---

	## 2A. ML Numeric Data

	### 2A.1 Data Source(s)

	\| Source \| Type \| Size \| Role \|
	\|---\|---\|---\|---\|
	\| `aiml2021/obesity` (Hugging Face) \| Tabular CSV → parquet \| 2,111 rows, 17 columns \| Training and evaluation of the obesity classifier and BMI regressor. Same as the UCI ML Repository "Estimation of Obesity Levels Based on Eating Habits and Physical Condition" dataset, redistributed under the original CC-BY licence. \|
	\| Derived CV nutrition (online) \| Float vector \| 1 per request \| Drives the `HighCaloricMeal` override at inference. Not in the training set. \|

	### 2A.2 Preprocessing and Features

	- Categorical columns (`Gender`, `family_history_with_overweight`, `FAVC`, `CAEC`, `SMOKE`, `SCC`, `CALC`, `MTRANS`) are one-hot encoded with `pd.get_dummies(drop_first=False)`. Keeping all dummies costs a few columns but makes the FAVC override unambiguous at inference.
	- Numeric columns (`Age`, `Height`, `Weight`, `FCVC`, `NCP`, `CH2O`, `FAF`, `TUE`) are kept as floats and scaled with `StandardScaler` inside the sklearn pipelines.
	- Derived target: `BMI = Weight / Height²` is computed once and stored for the regression head.
	- The 7-class `NObeyesdad` label is encoded with `LabelEncoder` so it can be consumed by XGBoost.
	- 80 / 20 stratified train / test split on the `NObeyesdad` class to preserve the (slightly imbalanced) seven-class distribution.

	### 2A.3 Model Selection

	- Regression head — BMI. Ridge baseline (`StandardScaler` + L2 with α = 1.0) chosen for interpretability and robust behaviour on small tabular data. Challenger: `XGBRegressor` (400 trees, depth 5, lr 0.05) — captures non-linear interactions between activity, water intake, and family history.
	- Classification head — Obesity level. Multinomial `LogisticRegression` (max_iter 2000) baseline. Challenger: `XGBClassifier` (same hyperparameters as the regressor, `mlogloss` objective). Multinomial logit is the standard reference for multi-class tabular benchmarks; XGB tends to win when interactions matter.

	### 2A.4 Model Comparison

	\| Iteration \| Objective \| Change \| Test metric \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| v0 \| Sanity check \| Empty fallback when artefacts absent \| Returns BMI from H+W and `Normal_Weight` \| Drives integration tests \|
	\| v1 \| Regression — BMI \| Ridge baseline \| MAE / R² recorded \| `train.py` head-to-head \|
	\| v2 \| Regression — BMI \| XGBRegressor \| MAE / R² recorded \| Compared against v1 \|
	\| v3 \| Classification — Level \| LogisticRegression \| Accuracy / Macro-F1 recorded \| `train.py` head-to-head \|
	\| v4 \| Classification — Level \| XGBClassifier \| Accuracy / Macro-F1 recorded \| Compared against v3 \|
	\| v5 \| Integration \| FAVC override from CV nutrition \| Inference-only — see 2A.6 \| Manual sanity check in `tests/test_numeric.py` \|

	Per-run metrics are written to `models/numeric_metadata.json` (regenerated by `python -m src.numeric.train`). The full per-class breakdown lives in `docs/numeric_evaluation.md`.

	### 2A.5 Evaluation and Error Analysis

	- Regression. MAE on held-out BMI, R², residual plot, residual-vs-prediction plot, and per-class residual MAE.
	- Classification. Accuracy, macro-F1, per-class precision/recall/F1, confusion matrix, ROC curves (one-vs-rest), calibration on the dominant `Normal_Weight` class.
	- Error patterns. The hardest classes are the two adjacent overweight bands — they share most habit features and are differentiated mainly by Weight. The classifier respects this: most errors are between neighbouring classes rather than across the spectrum. Feature importance (XGB) puts Weight, Height, and family history at the top; FAVC enters the top-10 once the CV override is active.

	### 2A.6 Integration

	- Inputs from CV. Meal nutrition vector → `derive_high_caloric_meal()` → boolean → overrides `FAVC_yes` / `FAVC_no` columns on the inference row.
	- Outputs to NLP. Obesity class, full per-class probability vector, predicted BMI, daily-kcal target (Mifflin-St Jeor, computed analytically — not learned), and a flag indicating whether the FAVC override was applied. All five fields land in the RAG prompt.

	---

	## 2B. NLP

	### 2B.1 Data Source(s)

	\| Source \| Type \| Size \| Role \|
	\|---\|---\|---\|---\|
	\| Hand-curated guideline corpus \| Text chunks paraphrased from WHO, EFSA, USDA, NHS, Harvard Nutrition Source \| 12 chunks \| RAG retrieval target (`src/nlp/corpus.py`) \|
	\| Held-out evaluation query suite \| Question + ground-truth relevant chunk IDs \| 12 queries (keyword / synonym / multi-topic) \| Retriever evaluation (`src/nlp/eval_queries.py`) \|
	\| Live OpenAI Chat Completions \| LLM API \| Per request, on deployment \| Generator option (gpt-4o-mini) \|

	### 2B.2 Preprocessing and Prompt Design

	- TF-IDF: bigrams + English stopword removal (`sklearn.feature_extraction.text.TfidfVectorizer`).
	- Dense: `sentence-transformers/all-MiniLM-L6-v2` with L2-normalised embeddings + cosine similarity.
	- The retrieval query is assembled from the meal + the numeric prediction: dish name, calories, predicted obesity class, predicted BMI, plus a "high-caloric meal flagged from photo" tag when the FAVC override fired. This is what threads the three blocks together at retrieval time.
	- The OpenAI prompt is split into a system message (role + grounding rules + citation format) and a user message containing nutrition, prediction, and the retrieved chunks. `temperature=0.4` keeps the prose coherent without making it deterministic.

	### 2B.3 Approach Selection

	Retrieval-augmented generation. The corpus is small enough (12 chunks) that brute-force retrieval is fine; the interesting comparison is sparse vs dense and template vs LLM generator.

	### 2B.4 Comparison and Iterations

	\| Iteration \| Objective \| Change \| Qualitative check \|
	\|---\|---\|---\|---\|
	\| v0 \| Plumb the pipeline \| Single chunk, hand-written explanation \| Confirms RAG plumbing \|
	\| v1 \| Retriever — sparse \| `TfidfRetriever` over 12-chunk corpus \| Good keyword recall \|
	\| v2 \| Retriever — dense \| `DenseRetriever` (`all-MiniLM-L6-v2`) \| Better on synonym queries \|
	\| v3 \| Generator — template \| Deterministic, rule-based prose \| Faithful by construction \|
	\| v4 \| Generator — OpenAI \| `gpt-4o-mini` with grounded prompt \| More natural, harder to verify — citation rule enforced in system prompt \|

	### 2B.5 Evaluation and Error Analysis

	- Retrieval (quantitative). `src/nlp/evaluate.py` runs each retriever over the 12-query suite and reports Recall@1, Recall@3, MRR, and a per-query-type breakdown.
	- Generator (qualitative). `docs/nlp_evaluation.md` contains a side-by-side of the template and OpenAI outputs on a sample of evaluation queries.
	- Failure cases identified. TF-IDF fails on "How much salt is too much?" (no token overlap with "sodium") and on short multi-topic queries; the dense retriever recovers both. The OpenAI generator can occasionally restate the rules verbosely — the system prompt's "max 4 short paragraphs" instruction keeps it bounded.

	### 2B.6 Integration

	- Inputs from Numeric. Obesity class + probabilities + predicted BMI + FAVC-override flag.
	- Inputs from CV. Nutrition vector (dish name, kcal, macros, sodium).
	- Outputs. Plain-language explanation citing the retrieved guideline sources by name (e.g. "[WHO Sodium Guideline 2012]").

	---

	## 2C. Computer Vision

	### 2C.1 Data Source(s)

	\| Source \| Type \| Size \| Role \|
	\|---\|---\|---\|---\|
	\| `nateraw/food` (Hugging Face) \| ViT-B/16 checkpoint, pretrained on Food-101 \| 86 M params \| Branch A — closed-vocabulary dish classifier \|
	\| OpenAI Vision (`gpt-4o`) \| Multimodal LLM, hosted \| n/a \| Branch B — open-vocabulary nutrition estimator \|
	\| OpenAI Chat (`gpt-4o-mini`) \| LLM, hosted \| n/a \| LLM-as-judge — picks between or merges the two branches \|
	\| Internal nutrition lookup \| Hand-curated dict \| 101 dishes \| Maps Food-101 class → kcal / protein / fat / carbs / sodium (`src/cv/nutrition_lookup.py`) \|
	\| Food-101 validation split \| Image labels \| 25,250 images \| Top-1 / Top-5 evaluation on deployment \|

	### 2C.2 Preprocessing and Augmentation

	- `AutoImageProcessor` for the ViT handles resize and ImageNet normalisation.
	- RGB conversion at input (guards against grayscale and RGBA uploads).
	- For the OpenAI vision branch: thumbnail to ≤ 1024 px on the longest edge, PNG-encode, base64-embed in the chat-completions request.
	- No augmentation at inference; the pretrained model already saw Food-101's training augmentations.

	### 2C.3 Model Selection

	The CV block is an ensemble of two complementary vision approaches with an LLM tie-breaker:

	- Branch A — Food-101 ViT (`nateraw/food`). Closed vocabulary (101 dishes). Fast (~1 s), free, deterministic. Strong on in-distribution Western restaurant dishes. Weak on plates that don't fit any of the 101 classes — by construction.
	- Branch B — OpenAI Vision (`gpt-4o`). Open vocabulary. Returns a structured nutrition vector directly from the image. Strong on arbitrary plates (home-cooked meals, side dishes, international cuisines). Costs ~$0.005 per image; requires `OPENAI_API_KEY`.
	- Judge — `gpt-4o-mini` (`src/cv/judge.py`). When both branches run, the judge sees the image plus both candidate nutrition vectors and either picks the more plausible one or proposes a merged vector. Returns the verdict with a short rationale.

	If only one branch is available (no API key, network failure), the block returns that branch's output and skips the judge. If neither is available, the lookup-table fallback keeps the downstream pipeline running.

	### 2C.4 Model Comparison

	\| Iteration \| Approach \| Strength \| Weakness \|
	\|---\|---\|---\|---\|
	\| v0 \| Lookup-table sanity check \| 101 / 101 dish coverage, macro-kcal mean discrepancy 2.6 % \| Approximation, not measurement \|
	\| v1 \| Food-101 ViT classifier + class→nutrition lookup \| Fast, free, deterministic; well-known checkpoint \| Closed 101-class vocabulary mis-routes OOD meals \|
	\| v2 \| OpenAI Vision (`gpt-4o`) direct regression \| Open vocabulary, returns structured nutrition for any image; explains what it sees \| API-dependent (~$0.005/request); no offline mode \|
	\| v3 \| LLM-as-judge ensemble (`gpt-4o-mini`) \| Routes each request to the right branch; provides a human-readable justification \| Adds ~$0.001 + ~1 s per request \|

	### 2C.5 Evaluation and Error Analysis

	- Quantitative (Branch A). Top-1 / Top-5 accuracy on the Food-101 validation split — recorded in `04_cv_evaluation.ipynb`, runs once on the Space where the validation dataset is reachable.
	- Quantitative (Branch B). No fixed benchmark; the open-vocab branch is evaluated qualitatively across an in-distribution set (pizza, lasagna, sushi) and an out-of-distribution set (mixed home-cooked plates) — examples in `docs/cv_evaluation.md`.
	- Ensemble behaviour. For each evaluation image we record: which branch won, the judge's rationale, the disagreement in kcal between the two branches. The expected pattern: ViT wins on in-distribution dishes (high confidence + low kcal disagreement); OpenAI wins on OOD plates (ViT confidence < 0.4 + large kcal disagreement).
	- Failure modes. Branch A: visually-similar dishes cluster in the confusion matrix (e.g. lasagna vs cannelloni). Branch B: occasionally over-estimates portion size; mitigated by the conservative-portion instruction in the prompt. Judge: very rarely picks the wrong branch; usually a higher-temperature symptom, mitigated by `temperature=0.1`.
	- Lookup-table consistency check. Macro-derived kcal vs stated kcal within ±10 % for 98 % of dishes (`docs/cv_evaluation.md`).

	### 2C.6 Integration

	- Outputs to Numeric: winning candidate's `calories_kcal` + `fat_g` → `derive_high_caloric_meal` → overrides `FAVC` feature before classification (`src/numeric/obesity.py::apply_favc_override`).
	- Outputs to NLP: winning candidate's dish name + nutrition vector → RAG retrieval query + grounding context in the generator prompt. The judge's rationale is also passed through the UI so users can see why a particular candidate was chosen.

	---

	## 3. Deployment

	- URL. _Added once the Space is live (Hugging Face Spaces, Gradio SDK)._
	- Main user flow.
	1. User uploads a meal photo and fills the profile / habits form.
	2. CV identifies the dish and estimates nutrition.
	3. Numeric block predicts obesity class + BMI (with FAVC overridden when the meal is high-caloric).
	4. RAG coach explains the prediction with citations.
	- Screenshots. Placed under `docs/screenshots/` once the deployment is live.

	---

	## 4. Execution Instructions

	```bash
	# 1. Clone & create env
	git clone <repo>
	cd forkcast
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt

	# 2. Train the numeric models (downloads the UCI dataset; <1 min)
	PYTHONPATH=. python -m src.numeric.train

	# 3. (Optional) Re-execute notebooks to regenerate figures + reports
	jupyter nbconvert --to notebook --execute --inplace notebooks/*.ipynb

	# 4. Run tests
	pytest tests/

	# 5. Launch the Gradio app
	python app.py
	```

	Reproducibility notes:
	- Train/test split seeded (`SEED = 42` in `src/numeric/train.py`).
	- All trained artefacts land in `models/`; metadata is the only one committed.
	- For the OpenAI generator, set `OPENAI_API_KEY` (e.g. via `.env` + `python-dotenv`). Without it, the RAG coach falls back to the template generator.

	---

	## 5. Optional Bonus Evidence

	- [x] All three blocks used (numeric + NLP + CV) and meaningfully integrated.
	- [x] Extended evaluation: per-class metrics, ROC curves, calibration plot, retriever Recall@k + MRR per query type.
	- [x] Ethical considerations: the app explicitly disclaims medical advice; the OpenAI prompt enforces source citation; the CV → FAVC override is documented as a design choice rather than hidden.
	- [x] Dataset bias analysis — see § 5.1 below.
	- [ ] Real-time logging / monitoring (out of scope for this submission).

	### 5.1 Dataset bias — gender × class distribution

	The UCI Obesity training distribution has a stark gender split in the two highest-severity classes:

	\| Class \| Female \| Male \|
	\|---\|---\|---\|
	\| Insufficient_Weight \| 173 \| 99 \|
	\| Normal_Weight \| 141 \| 146 \|
	\| Overweight_Level_I \| 145 \| 145 \|
	\| Overweight_Level_II \| 103 \| 187 \|
	\| Obesity_Type_I \| 156 \| 195 \|
	\| Obesity_Type_II \| 2 \| 295 \|
	\| Obesity_Type_III \| 323 \| 1 \|

	`Obesity_Type_II` is 99.3 % male and `Obesity_Type_III` is 99.7 % female in the training set — the classes are effectively gender-segregated at the top. The model has correctly learned this correlation, which means an otherwise-identical profile predicts a different obesity band purely from the `Gender` field. Side-by-side at BMI 40.8 (125 kg / 175 cm, same habits):

	- `Gender = Male` → `Obesity_Type_II` (p ≈ 0.99)
	- `Gender = Female` → `Obesity_Type_III` (p ≈ 0.51)

	This is honest model behaviour given the data, but it is a fairness concern that a clinical deployment would need to mitigate. Three concrete options:
	1. Rebalance the training set — SMOTE/over-sample the under-represented gender within each high-severity class.
	2. Drop `Gender` from the classifier features — keep it only in the regression / TDEE path so it doesn't shift the categorical decision.
	3. Post-process with WHO BMI bands — override the classifier's verdict with the standard WHO mapping when the prediction disagrees with `predicted_bmi`.

	The regression head (`predicted_bmi`) is unaffected by this issue — it learns a continuous function of Weight and Height that is not conditioned on the label correlation.

	---

	## Appendix — Required collaborators

	- Jasmin Heierli — `@jasminh`
	- Benjamin Kühnis — `@bkuehnis`