Spaces:

Gianone
/

smartplate

Sleeping

Primary block 1: Computer Vision (Vision Transformer on Food-101)
Primary block 2: NLP / RAG (LLM with retrieval over WHO/DGE/Harvard guidelines)

Third block (extra work, graded separately): ML Numeric Data (health classification from nutrient features) — see Section 5 (Bonus Evidence) and the dedicated block section 2A.

1. Project Foundation (Short)

1.1 Problem Definition

Problem statement: Health-conscious users want quick nutritional analysis of meals, but existing diet-tracking apps require manual logging of every meal — a friction point that explains their high abandonment rate. Nutritional information is rarely available at the point of consumption (e.g. when eating out), and even when labels exist, interpreting them in the context of personal health goals requires expert knowledge.
Goal: Build a multi-modal AI nutrition assistant where the user photographs a meal and receives: (1) a food-class detection, (2) per-100g nutritional values + a health classification, and (3) a personalised, evidence-based explanation grounded in authoritative nutrition guidelines (WHO, DGE, Harvard).
Success criteria:
- CV model: ≥90% accuracy on a 20-class held-out validation set
- ML model: classifier successfully distinguishes the three health tiers (healthy/medium/unhealthy) with macro-F1 ≥0.85
- RAG: answers cite at least one retrieved source per response, with no factual hallucinations on test questions
- End-to-end: deployed app responds to a real food photo with all three outputs in under 30 seconds (steady-state)

1.2 Integration Logic

How the selected blocks interact: The blocks form a strict pipeline. The Computer Vision block classifies the input image into one of 20 food classes. This class is the key that drives the ML Numeric block's nutrient lookup and health-tier classification. The food class + nutrient values + health tier are then injected as context into the NLP/RAG block's prompt, which retrieves the most relevant chunks from the WHO/DGE/Harvard knowledge base and generates a personalised response. No block can be removed without breaking the others.
Data and output flow between blocks:

[User uploads photo]
       ↓
[CV: ViT predicts food_class + confidence]
       ↓ food_class="pizza"
[ML: lookup nutrition + predict health_label]
       ↓ {kcal=266, fat=10g, ..., health_label="unhealthy"}
[NLP: retrieve top-3 chunks → build prompt → LLM generates answer]
       ↓ {answer, sources}
[Gradio UI: display all three outputs side by side]

See src/pipeline.py for the orchestration logic.

2. Block Documentation

2A. ML Numeric Data

2A.1 Data Source(s)

Entry	Source name or link	Type	Size	Role in this block
1	USDA FoodData Central (curated reference values, see `notebooks/03_ml_health_classifier.ipynb`)	Structured nutrient values	20 classes × 8 nutrients	Source of ground-truth nutrition per food class
2	Augmented dataset (USDA values + ±15% Gaussian noise)	Tabular numeric	1000 rows × 16 features	Training data for the health classifier
3	Open Food Facts API (initial attempt — see honest discussion in 2A.2)	Crowdsourced structured	Variable	Originally planned source; abandoned after 503 errors and data-quality issues

2A.2 Preprocessing and Features

Cleaning steps:
- Original plan: Open Food Facts API for all 20 classes. Result: 8 of 20 classes returned 503 Server Errors during scraping, and the 12 successful classes had nonsensical medians (every class ~51 kcal/100g) because OFF's full-text search returned irrelevant products (e.g. "caesar salad" matched "mineral water" and "milk").
- Pivot: USDA reference values per 100g for all 20 classes, sourced manually from USDA FoodData Central and the Swiss nutrition database. Eight nutrients per class: kcal, fat, saturated_fat, carbs, sugar, fiber, protein, salt.
Preprocessing steps:
- StandardScaler normalisation (mean 0, std 1) on training data, persisted in the model bundle
- LabelEncoder for the 3-class target (healthy/medium/unhealthy), with explicit ordering
- Stratified 80/20 train/test split (seed 42)
Feature engineering and selection: 16 features total — see notebooks/03_ml_health_classifier.ipynb
- 8 raw nutrients: kcal, fat, sat_fat, carbs, sugar, fiber, protein, salt
- 5 derived ratios: sugar_to_carb_ratio, sat_fat_pct_of_fat, calorie_density, protein_to_kcal, fiber_to_carb_ratio
- 3 WHO-threshold binary flags: high_sugar (>15g), high_salt (>1.5g), high_sat_fat (>5g)
- Augmentation: 50 samples per class generated with ±15% Gaussian noise around USDA medians → 1000 training rows

2A.3 Model Selection

Models tested:
- Iteration 1: Logistic Regression (multinomial, L2 regularisation, C=1.0)
- Iteration 2: XGBoost Classifier (200 trees, max_depth=5, learning_rate=0.1)
Why these models were chosen:
- Logistic Regression as the linear baseline — fast, interpretable (coefficients per class), proven on tabular problems
- XGBoost as the non-linear comparison — typical default for structured/tabular data, captures feature interactions

2A.4 Model Comparison and Iterations

Iteration	Objective	Key changes	Models used	Main metric	Change vs previous
1	Establish linear baseline	StandardScaler + LR (C=1.0, multinomial)	LogisticRegression	Test Acc: 100.0%, F1-macro: 100.0%, CV: 99.88% ± 0.25%	Baseline
2	Test if non-linear model improves	XGBoost (200 trees, depth=5)	XGBClassifier	Test Acc: 99.5%, F1-macro: 99.55%, CV: 99.88% ± 0.25%	-0.5 pp vs Iter 1
3	N/A — winner already at 100% in Iter 1, further iteration would overfit	—	—	—	—

Final choice: Logistic Regression. This is counter-intuitive (XGBoost is the usual tabular default), but justified because: (a) higher test accuracy on this data, (b) faster inference, (c) interpretable coefficients, (d) simpler deployment with one less dependency. See artefacts at assets/screenshots/ml/model_comparison.csv.

2A.5 Evaluation and Error Analysis

Metrics used: Test accuracy, macro-F1, 5-fold stratified cross-validation, per-class precision/recall/F1, confusion matrix
Final results:
- Logistic Regression test accuracy: 100.0%
- Per-class F1: healthy 100%, medium 100%, unhealthy 100%
- Cross-validation: 99.88% ± 0.25%
- See assets/screenshots/ml/ml_confusion_matrix.png and assets/screenshots/ml/feature_importance.png
Error patterns and likely causes: The 100% accuracy reflects that our augmented data is linearly separable by class — a consequence of the augmentation methodology (±15% noise around well-defined USDA medians with no class overlap). This is honest evidence that:
- The classifier is reliable within its training distribution
- A real-world deployment would face fuzzier boundaries (deep-dish pizza with extra cheese vs. thin margherita, portion-size variation, hand-prepared dishes diverging from USDA medians)
- The model would not necessarily achieve 100% on a different data distribution

2A.6 Integration with Other Block(s)

Inputs received from other block(s): A single string food_class from the CV block (e.g. "pizza")
Outputs provided to other block(s): A dictionary with nutrition (8 raw values), health_label (string: healthy/medium/unhealthy), and probabilities (per-class confidence). See src/ml_model.py.

2B. NLP

2B.1 Data Source(s)

Entry	Source name or link	Type	Size	Role in this block
1	WHO Healthy Diet Fact Sheet	PDF (13 pages, ~18,500 chars)	119 KB	Knowledge base — global policy guidelines
2	DGE — 10 Rules of Healthy Eating	PDF (8 pages, ~5,000 chars)	778 KB	Knowledge base — German nutrition society, layperson-oriented
3	Harvard Healthy Eating Plate	PDF (6 pages, ~5,300 chars)	358 KB	Knowledge base — research-oriented Anglo-American perspective

Total: 27 pages, ~29,000 characters. Three deliberately different perspectives produce richer retrieval results than a single source.

2B.2 Preprocessing and Prompt Design

Text preprocessing:
- PDF text extraction via pypdf
- Chunking with langchain_text_splitters.RecursiveCharacterTextSplitter
- chunk_size = 500 characters, chunk_overlap = 100 characters
- Result: ~60–90 chunks total → cached in models/rag_chunks.json
Prompt design or retrieval setup:
- Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-dim)
- Vector store: ChromaDB (in-memory, rebuilt from rag_chunks.json on startup)
- Top-k retrieval: 3 chunks per query
- LLM: OpenAI gpt-4o-mini
- Two prompt strategies compared (see 2B.4)

2B.3 Approach Selection

Approach used: Retrieval-Augmented Generation (RAG) with sentence-transformer embeddings + LLM. See src/nlp_rag.py.
Alternatives considered:
- Fine-tuning a smaller LLM on the guidelines → rejected: training cost vs. benefit unfavourable for 29k chars
- Few-shot prompting without retrieval → rejected: would not scale to add new guidelines later
- Classical NLP (tf-idf retrieval + extraction) → rejected: cannot generate natural responses
- RAG chosen for best balance of factual grounding, low cost, and conversational quality

2B.4 Comparison and Iterations

Iteration	Objective	Key changes	Model or prompt setup	Main metric or qualitative check	Change vs previous
1	Establish baseline	Plain "Context: ... Question: ... Answer briefly."	gpt-4o-mini, top-k=3, temp=0.3	Keyword coverage: 59.4%, source match: 87.5%, avg tokens: 356	Baseline
2	Test if structure improves quality	XML-tagged sources + strict grounding instruction ("only use sources, else say 'no info'")	gpt-4o-mini, top-k=3, temp=0.3	Keyword coverage: 36.5%, source match: 87.5%, avg tokens: 494	-22.9 pp coverage, +38% tokens
3	N/A — Strategy 1 dominates on all measurable axes	—	—	—	—

Final choice: Strategy 1 (Basic). The more sophisticated XML prompt's strict "only use sources" constraint caused the LLM to refuse to answer when source phrasings did not exactly match the question (e.g. it refused to discuss eggs because the WHO/DGE documents do not contain the literal word "eggs", although protein sources are clearly addressed). See assets/screenshots/rag/rag_strategy_comparison.csv and assets/screenshots/rag/rag_evaluation.csv.

Lesson worth keeping: Prompt structure ≠ prompt quality. Default-mode LLMs already use context effectively; over-restrictive grounding instructions can backfire.

2B.5 Evaluation and Error Analysis

Evaluation strategy: 8 hand-crafted test questions covering different nutrient topics (sugar, salt, fats, balanced meal, etc.), 2 questions in German, 6 in English. For each, measure: keyword coverage (fraction of expected keywords present in the answer), source match (whether the cited source matches the expected one), token count (efficiency).
Results: Strategy 1 wins on every axis. See full per-question results in assets/screenshots/rag/rag_evaluation.csv.
Error patterns and likely causes:
- Q6 "Should I drink milk?" → 0% source match for both strategies because WHO/DGE do not explicitly discuss dairy in the indexed PDFs. Harvard does but in a different chunk than what was retrieved. Mitigation idea: add re-ranking or query rewriting.
- Q8 "Are eggs healthy?" → Strategy 1 succeeded with high coverage; Strategy 2 refused because "eggs" is not a literal term in the source documents. Demonstrates the over-grounding failure mode.

2B.6 Integration with Other Block(s)

Inputs received from other block(s):
- From CV: food_class (string)
- From ML: kcal (float), health_label (string) → injected into the prompt as conditioning context
Outputs provided to other block(s): Final user-facing response — markdown answer + list of cited sources. This is the final output of the pipeline; no downstream block consumes it. See src/nlp_rag.py.

2C. Computer Vision

2C.1 Data Source(s)

Entry	Source name or link	Type	Size	Role in this block
1	Food-101 dataset (20-class subset)	RGB images (224×224)	15,000 train + 5,000 val	Training + validation data for ViT fine-tuning
2	google/vit-base-patch16-224	Pre-trained Vision Transformer	86M parameters	Base model (transfer learning starting point)

2C.2 Preprocessing and Augmentation

Image preprocessing:
- Resize to 224×224 (ViT input requirement)
- ImageNet normalisation: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
Augmentation strategy: Training-set only — random horizontal flip + colour jitter (brightness, contrast, saturation). Validation set: no augmentation. See notebooks/02_train_vit_cv.ipynb.

2C.3 Model Selection

Vision model(s) used: google/vit-base-patch16-224 Vision Transformer (pre-trained on ImageNet-21k)
Why these model(s) were chosen: ViT-Base is the standard transformer-based image classifier — comparable in accuracy to ResNet-50 but with attention-based interpretability and a clean Hugging Face integration. The 20-class subset is small enough that we can fully fine-tune in <20 minutes on a single T4 GPU, which fits the time budget of a semester project.

2C.4 Model Comparison and Iterations

Iteration	Objective	Key changes	Model(s) used	Main metric	Change vs previous
1	Head-only fine-tuning baseline	Freeze backbone, train only classification head (0.02% trainable params), 3 epochs, lr=2e-5	ViT-Base + linear head	Val Acc: 95.12%, F1-macro: 95.12%, training time: 9:16 min	Baseline
2	Full fine-tuning for max accuracy	All 86M params trainable, 3 epochs, lr=2e-5	ViT-Base full FT	Val Acc: 96.46%, F1-macro: 96.46%, training time: 15:17 min	+1.34 pp, +65% time
3	N/A — Iter 2 deployed; further epochs risk overfitting on 15k images	—	—	—	—

Final choice: Iteration 2 (full fine-tuning). The +1.34 pp gain in accuracy is worth the +6 min one-time training cost; inference time is identical (single forward pass). The trained model is published on Hugging Face Hub at Gianone/smartplate-vit-food. See assets/screenshots/cv/iteration_comparison.csv.

2C.5 Evaluation and Error Analysis

Metrics and/or visual checks: Validation accuracy, macro-F1, per-class confusion matrix, top-K error analysis. See assets/screenshots/cv/confusion_matrix.png.
Final results:
- Validation accuracy: 96.46%
- F1-macro: 96.46%
- Best-performing classes: edamame (~~99% F1), miso_soup (~~98%), french_fries (~98%) — visually distinctive
- Hardest pairs: sushi↔sashimi (most-confused pair), donuts↔pancakes (similar shape and texture). See assets/screenshots/cv/top_errors.csv.
Error patterns and limitations:
- Sushi↔sashimi confusion is acceptable downstream — both are "healthy" in the ML lookup, so the user experience is unaffected
- Donuts↔pancakes confusion is more consequential — donuts are "unhealthy", pancakes "medium". Mitigation: confidence score is exposed in the UI; low-confidence predictions (<70%) can be flagged
- Food-101 over-represents Western cuisine; classes from under-represented cuisines would likely have lower accuracy in a deployment

2C.6 Integration with Other Block(s)

Inputs received from other block(s): None — CV is the entry point of the pipeline
Outputs provided to other block(s): A dictionary {class, confidence, top_5} consumed by the ML block (for nutrient lookup) and the NLP block (as context in the prompt). See src/cv_model.py.

3. Deployment

Deployment URL: https://huggingface.co/spaces/Gianone/smartplate
Main user flow:
1. User uploads a food photo via the Gradio UI
2. (Optional) Types a question (e.g. "Can I eat this on a diet?")
3. Clicks Analyze
4. CV → ML → NLP run sequentially (~30 s on first run, ~5 s steady-state)
5. UI displays: detected food class + confidence, top-5 predictions, nutrition table per 100g, health category with probabilities, and the NLP-generated explanation with cited sources
Screenshot or short demo:

Local app screenshots (`assets/screenshots/app/`)	Live HF Space screenshots (`assets/screenshots/deployment/`)
, ,	, ,

All screenshots show the complete pipeline: image preview, CV prediction with confidence, nutrition table, health label with probabilities, and the LLM's evidence-grounded explanation with source citations.

Training/inference separation: Training code lives exclusively in the four Jupyter notebooks (notebooks/01_eda_food101.ipynb through 04_rag_setup.ipynb). Inference code lives in src/ and app.py. The trained ViT model is hosted on Hugging Face Hub (Gianone/smartplate-vit-food); the trained ML model is committed at models/health_classifier.pkl; RAG chunks at models/rag_chunks.json. No training code is ever executed in production.

Deployment friction documented honestly: The HF Spaces build failed four times before succeeding (Gradio version conflict, Python 3.13 audioop bug, huggingface_hub API change, Gradio API-schema parser bug fixed with a monkey-patch in app.py). All four are documented in the HF Space's git history.

4. Execution Instructions

Environment setup

git clone https://github.com/Gianone-byte/smartplate.git
cd smartplate
python3 -m venv .venv
source .venv/bin/activate      # macOS / Linux
# .venv\Scripts\activate       # Windows
pip install --upgrade pip
pip install -r requirements.txt

Data setup

The repository ships with all required artefacts:

ML model: models/health_classifier.pkl (4.6 KB)
RAG chunks: models/rag_chunks.json (45 KB)
Knowledge base PDFs: data/knowledge_base/

The CV model is downloaded automatically from Hugging Face Hub on first run (Gianone/smartplate-vit-food, ~340 MB).

Training command(s)

Optional — only needed to reproduce models from raw data. All four notebooks run end-to-end:

jupyter notebook notebooks/01_eda_food101.ipynb          # ~10 min, EDA
jupyter notebook notebooks/02_train_vit_cv.ipynb         # ~25 min, T4 GPU recommended
jupyter notebook notebooks/03_ml_health_classifier.ipynb # ~2 min, CPU
jupyter notebook notebooks/04_rag_setup.ipynb            # ~5 min, CPU + OpenAI API

Inference/run command(s)

cp .env.example .env
# → open .env and add OPENAI_API_KEY
python app.py
# → opens http://127.0.0.1:7860

Reproducibility notes

Random seeds: 42 everywhere (numpy, sklearn, transformers Trainer)
Pinned versions: scikit-learn==1.5.1 (matches training environment), huggingface_hub==0.24.7 (Gradio 4.32 compatibility)
Python: 3.9 supported locally; Python 3.10 required on HF Spaces (declared in README.md SDK metadata)
All 18 smoke tests in tests/test_pipeline.py should pass: python -m pytest tests/ -v

5. Optional Bonus Evidence

Third selected block implemented with strong quality — ML Numeric Data implemented as full block (USDA-derived dataset, feature engineering, 2-model comparison, evaluation) and integrated into the pipeline. See Section 2A.
More than two data sources used with clear added value — Six distinct sources used: Food-101 dataset (CV training), pre-trained ViT (CV transfer learning), USDA nutrition values (ML labels), augmented synthetic dataset (ML training), WHO + DGE + Harvard PDFs (3 separate RAG sources). Each has a justified role.
A core section is done exceptionally well — Section 2B (NLP/RAG) includes a quantitative prompt-strategy comparison with 8 test questions, three measured metrics, and a non-obvious finding (the simpler prompt outperforms the more structured one). The result is reported honestly with the failure-mode explanation.
Extended evaluation — Confusion matrices for both CV and ML blocks, feature importance plot for ML, per-question RAG evaluation CSV, per-class CV error analysis CSV. All committed under assets/screenshots/.
Ethics, bias, or fairness analysis — Dedicated discussion of health-advice liability, training-data bias (Food-101 over-represents Western cuisine), eating-disorder safety considerations, and data-privacy commitments. See in-line discussion at end of Section 5 below.
Creative or exceptional use case — see honest self-assessment: the use case is realistic and well-motivated but not unprecedented (similar apps exist commercially).

Ethics, bias and fairness — additional discussion

This is a project for academic use and is documented as such. A production deployment would require the following safeguards, which are out of scope for the prototype but are worth naming:

Health-advice liability. SmartPlate generates dietary suggestions. It is not a substitute for advice from a registered dietitian or doctor. A production version would need clear disclaimers and probably geographic gating (FDA, EU and Swissmedic each regulate health-claim advice differently).
Training-data bias. Food-101 over-represents Western cuisine (burgers, pizza, French pastries). The system will be less accurate on dishes from cuisines under-represented in Food-101. Per-class confidence scores expose this when accuracy drops, but only to users who know what to look for.
Eating-disorder safety. A nutrition app can be triggering for users with eating disorders. The production prompt was deliberately designed to avoid moralising language ("bad food", "forbidden", "you should not eat this") and to acknowledge the emotional side of food. A production version would need explicit content guardrails for known triggering phrases and a clear path to support resources.
Privacy. Photos are processed in-memory and not stored. Conversation history is not retained between sessions. The OpenAI API does log requests per their terms of service — this is documented in documentation.md but would need more prominence in a real deployment.
Reproducibility. Every artefact is either committed to git or referenced from Hugging Face Hub. The four notebooks reproduce all results from raw data with fixed seeds. No undocumented hyperparameters.