Spaces:
Sleeping
Sleeping
Commit
·
fd1ef54
1
Parent(s):
11b1420
Add app.py and setup files for Hugging Face Space
Browse files- README.md +194 -8
- app.py +251 -0
- config.yaml +73 -0
- requirements.txt +14 -0
README.md
CHANGED
|
@@ -1,12 +1,198 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo: green
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
-
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "Coffee Cup Points Estimator"
|
| 3 |
+
emoji: "☕️"
|
| 4 |
+
colorFrom: "brown"
|
| 5 |
+
colorTo: "green"
|
| 6 |
+
sdk: "gradio"
|
| 7 |
+
sdk_version: "5.49.1"
|
| 8 |
+
app_file: "app.py"
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
# Module3Project
|
| 15 |
+
|
| 16 |
+
# Overview
|
| 17 |
+
Predict coffee quality scores based on sensory attributes using a RandomForest model and an MLOps pipeline.
|
| 18 |
+
This project demonstrates an end-to-end MLOps pipeline: data ingestion, preprocessing, model training, containerization, cloud deployment, and front-end integration.
|
| 19 |
+
|
| 20 |
+
# Data
|
| 21 |
+
For this project, we are using data on coffee quality found here:
|
| 22 |
+
https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi
|
| 23 |
+
|
| 24 |
+
The cleaned coffee dataset is publicly hosted on Google Cloud Storage for reproducibility.
|
| 25 |
+
The preprocessing pipeline automatically downloads it via the data.url field in config.yaml.
|
| 26 |
+
|
| 27 |
+
Cleaned data is hosted in Google Cloud Storage:
|
| 28 |
+
https://storage.googleapis.com/coffee-quality-data/preprocessed_data.csv
|
| 29 |
+
|
| 30 |
+
# Architecture
|
| 31 |
+
Data → Cloud (GCS) → Preprocess (ColumnTransformer) → Train (RandomForest) → FastAPI → Gradio frontend
|
| 32 |
+
|
| 33 |
+
## Frontend Architecture
|
| 34 |
+
┌───────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐
|
| 35 |
+
│ Kaggle Data │ → │ GCS Bucket │ → │ FastAPI (API)│ → │ Gradio UI │
|
| 36 |
+
└───────────────┘ └─────────────┘ └───────────────┘ └──────────────┘
|
| 37 |
+
|
| 38 |
+
# Frontend
|
| 39 |
+
The Gradio-based frontend is deployed at:
|
| 40 |
+
|
| 41 |
+
# Cloud Deployment:
|
| 42 |
+
The FastAPI container is deployed on Google Cloud Run at:
|
| 43 |
+
Base URL:
|
| 44 |
+
https://coffee-api-354131048216.us-central1.run.app
|
| 45 |
+
|
| 46 |
+
Endpoints:
|
| 47 |
+
|
| 48 |
+
- /health – Health check
|
| 49 |
+
- /predict_named – POST endpoint for predictions
|
| 50 |
+
- /docs - API documentation (Swagger)
|
| 51 |
+
|
| 52 |
+
Example cURL:
|
| 53 |
+
```
|
| 54 |
+
curl -X POST "https://coffee-api-354131048216.us-central1.run.app/predict_named" \
|
| 55 |
+
-H "Content-Type: application/json" \
|
| 56 |
+
-d '{"rows":[{"Aroma":7.5,"Flavor":6.0,"Body":5.5,"Acidity":8.0,"Sweetness":9.0,"Balance":7.0,"Aftertaste":6.5,"Clean.Cup":9.0}]}'
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
# Setup:
|
| 60 |
+
```
|
| 61 |
+
python -m venv venv
|
| 62 |
+
source venv/bin/activate # Windows: venv\Scripts\activate
|
| 63 |
+
pip install --upgrade pip
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
```
|
| 66 |
+
# Testing/running scripts
|
| 67 |
+
To test preprocess.py:
|
| 68 |
+
```
|
| 69 |
+
python scripts/preprocess.py
|
| 70 |
+
```
|
| 71 |
+
Confirm all output files exist by running:
|
| 72 |
+
```
|
| 73 |
+
ls -l data/cleaned/X_train.csv data/cleaned/X_test.csv data/cleaned/y_train.csv data/cleaned/y_test.csv artifacts/preprocessor.joblib
|
| 74 |
+
```
|
| 75 |
+
We wrote a unit test script tests/test_preprocessor.py, to run it:
|
| 76 |
+
```
|
| 77 |
+
pip install pytest
|
| 78 |
+
pytest -q
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
To run the server, do health check use sample predict payload:
|
| 82 |
+
```
|
| 83 |
+
uvicorn app.server:app --reload --port 8000
|
| 84 |
+
curl http://127.0.0.1:8000/health
|
| 85 |
+
curl -X POST "http://127.0.0.1:8000/predict_named" \
|
| 86 |
+
-H "Content-Type: application/json" \
|
| 87 |
+
-d '{"rows":[ {"Aroma":7.5,"Flavor":6.0,"Number.of.Bags":1,"Category.One.Defects":0} ] }'
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
To train the model:
|
| 91 |
+
```
|
| 92 |
+
python scripts/train.py
|
| 93 |
+
```
|
| 94 |
+
Ensure artifacts/model.joblib was built
|
| 95 |
+
|
| 96 |
+
To run the UI app start the server and type in CLI:
|
| 97 |
+
```
|
| 98 |
+
python app/frontend.py
|
| 99 |
+
Enter 3 when prompted:
|
| 100 |
+
wandb: (1) Create a W&B account
|
| 101 |
+
wandb: (2) Use an existing W&B account
|
| 102 |
+
wandb: (3) Don't visualize my results
|
| 103 |
+
My personal login is needed to sign in here to update to wandb website
|
| 104 |
+
|
| 105 |
+
```
|
| 106 |
+
Open link in browser
|
| 107 |
+
|
| 108 |
+
# Model
|
| 109 |
+
We used a RandomForestRegression for the model. Test size is 20% of dataset. Model has accuracy of 94.2% with 100 estimators.
|
| 110 |
+
|
| 111 |
+
W and B tracks model performance. Data can be found in wandb/run.../files/wandb-summary.json. Data is presented like this:
|
| 112 |
+
```
|
| 113 |
+
{
|
| 114 |
+
"_timestamp":1.763876781125257e+09,
|
| 115 |
+
"_wandb":{"runtime":2},
|
| 116 |
+
"_runtime":2,
|
| 117 |
+
"_step":0,
|
| 118 |
+
"R2":0.9424069488737763,
|
| 119 |
+
"RMSE":0.5528660703704987,
|
| 120 |
+
"MAE":0.31615526315789416,
|
| 121 |
+
"MAPE":0.39006294567905464
|
| 122 |
+
}
|
| 123 |
+
```
|
| 124 |
+
These perfomance metrics are also stored in artifacts.metrics.json like this:
|
| 125 |
+
```
|
| 126 |
+
{
|
| 127 |
+
"R2": 0.9424069488737761,
|
| 128 |
+
"RMSE": 0.5528660703704994,
|
| 129 |
+
"MAE": 0.31615526315789455,
|
| 130 |
+
"MAPE": 0.39006294567905514
|
| 131 |
+
}
|
| 132 |
+
```
|
| 133 |
+
The 94.2% R2 value shows very good fit and a cup score that correlates strongly with the other columns. The RMSE 0f 0.55 shows a small predicition error and therefore reinforces the model's high preformance. The MAE of 0.314 also shows a small error to the actual cup points. MAPE shows average percentage error of 39% which shows medium accuracy. This could be due to the small size dataset the model was trained on.
|
| 134 |
+
|
| 135 |
+
# 🐳 Docker and Testing
|
| 136 |
+
## Build the image
|
| 137 |
+
```
|
| 138 |
+
# from the project root
|
| 139 |
+
docker build -t coffee-api:dev .
|
| 140 |
+
docker run --rm -e WANDB_MODE=offline -p 8000:8000 coffee-api:dev
|
| 141 |
+
```
|
| 142 |
+
Note: Use WANDB_MODE=offline (as shown above) when running inside Docker or CI to prevent login prompts from Weights & Biases. If you have a W&B API key, set it via WANDB_API_KEY=your_key to enable cloud logging.
|
| 143 |
+
|
| 144 |
+
## Run the container
|
| 145 |
+
```
|
| 146 |
+
docker run --rm -p 8000:8000 \
|
| 147 |
+
-v "$(pwd)/artifacts":/app/artifacts \
|
| 148 |
+
-v "$(pwd)/config.yaml":/app/config.yaml \
|
| 149 |
+
-v "$(pwd)/data":/app/data \
|
| 150 |
+
coffee-api:dev
|
| 151 |
+
```
|
| 152 |
+
Then open:
|
| 153 |
+
• Health check: http://127.0.0.1:8000/health
|
| 154 |
+
• Interactive docs: http://127.0.0.1:8000/docs
|
| 155 |
+
|
| 156 |
+
If artifacts are missing, the container automatically runs scripts/preprocess.py to generate them.
|
| 157 |
+
|
| 158 |
+
## Run tests inside the container
|
| 159 |
+
|
| 160 |
+
To verify reproducibility of preprocessing and data pipeline:
|
| 161 |
+
```
|
| 162 |
+
docker run --rm -v "$(pwd)":/app -w /app coffee-api:dev python -m pytest -q
|
| 163 |
+
```
|
| 164 |
+
Expect output:
|
| 165 |
+
```
|
| 166 |
+
...
|
| 167 |
+
3 passed in ~0.9s
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
## Docker-related notes:
|
| 171 |
+
- Ports: container exposes 8000 (mapped to host port 8000)
|
| 172 |
+
- Artifacts (preprocessor.joblib, model.joblib) are mounted from the host for faster iteration
|
| 173 |
+
|
| 174 |
+
# Limitations & Ethics
|
| 175 |
+
Predictions depend on sensory ratings, which are subjective.
|
| 176 |
+
The model is not suitable for real-world evaluation of coffee quality without expert calibration.
|
| 177 |
+
The dataset may contain sampling bias by country or producer, and model predictions should not be used for commercial grading without calibration against expert cuppers.
|
| 178 |
+
|
| 179 |
+
# Notes / Gotchas
|
| 180 |
+
- config.yaml may include data.input_columns — if present the server will require/expect those columns and reindex incoming payloads automatically.
|
| 181 |
+
- The server will try to load artifacts/preprocessor.joblib and artifacts/model.joblib. If those are missing the server returns deterministic dummy predictions (development mode).
|
| 182 |
+
|
| 183 |
+
# ☁️ Cloud Services Used
|
| 184 |
+
- **Google Cloud Storage (GCS):** Stores the cleaned dataset (`preprocessed_data.csv`) publicly.
|
| 185 |
+
- **Google Cloud Run:** Hosts and serves the FastAPI model API container.
|
| 186 |
+
- **Weights & Biases (W&B):** Tracks model training metrics and performance.
|
| 187 |
+
|
| 188 |
+
# Hugging Face
|
| 189 |
+
|
| 190 |
+
# 🧠 Authors
|
| 191 |
+
- Eugenia Tate
|
| 192 |
+
- Avery Estopinal
|
| 193 |
+
|
| 194 |
+
# References:
|
| 195 |
+
- OpenAI. (2025). ChatGPT (Version 5.1) [Large language model]. https://chat.openai.com We used ChatGPT (OpenAI GPT-5.1) to assist with code snippets.
|
| 196 |
+
Portions of the preprocessing, frontend, train and most of server code were assisted by ChatGPT (OpenAI GPT-5.1). Authors verified and adapted the generated code.
|
| 197 |
+
Authors fully understand what the code does and how to apply the knowledge in the future.
|
| 198 |
+
- Kaggle Coffee Quality Data (Volpatto, 2020) https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi
|
app.py
ADDED
|
@@ -0,0 +1,251 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# app/frontend.py
|
| 2 |
+
# Author: Eugenia Tate
|
| 3 |
+
# Date: 11/23/2025
|
| 4 |
+
|
| 5 |
+
# CITATION:
|
| 6 |
+
# ChatGPT was used to prototype robust JSON-sanitization and input-coercion logic when encountering serialization errors
|
| 7 |
+
# and mixed user inputs (strings, noisy numeric text, pandas / numpy scalars). A recursive approach and conversion patterns
|
| 8 |
+
# were suggested; we reviewed and thoroughly tested the code locally. See coerce_and_clamp_dict() and make_json_safe() below.
|
| 9 |
+
|
| 10 |
+
# import necessary helpers
|
| 11 |
+
import os
|
| 12 |
+
import yaml
|
| 13 |
+
import json
|
| 14 |
+
import math
|
| 15 |
+
import pandas as pd, numpy as np # table handling
|
| 16 |
+
import gradio as gr # UI
|
| 17 |
+
import requests # to call API server
|
| 18 |
+
from typing import Dict, Any, List
|
| 19 |
+
|
| 20 |
+
# point to config.yaml file to retrieve API URL
|
| 21 |
+
CONFIG_PATH = os.path.join(os.getcwd(), "config.yaml")
|
| 22 |
+
# The above line was modified by ChatGPT 5.1 at 10:41a on 11/24/25 to work with Hugging Face
|
| 23 |
+
# if config exists - load it
|
| 24 |
+
if os.path.exists(CONFIG_PATH):
|
| 25 |
+
with open(CONFIG_PATH, "r") as f:
|
| 26 |
+
cfg = yaml.safe_load(f)
|
| 27 |
+
# if config does not exist - it falls back to being an empty dict
|
| 28 |
+
else:
|
| 29 |
+
cfg = {}
|
| 30 |
+
|
| 31 |
+
# server endpoint UI will use for POST; if confid is missing fallback to predict_named
|
| 32 |
+
API_URL = cfg.get("api_url", {}).get("FastAPI", "http://127.0.0.1:8000/predict_named")
|
| 33 |
+
|
| 34 |
+
# reduced set of sensible columns exposed in UI to the end user
|
| 35 |
+
INPUT_COLS = [
|
| 36 |
+
"Aroma", "Flavor", "Aftertaste", "Acidity",
|
| 37 |
+
"Body", "Balance", "Sweetness", "Clean.Cup"
|
| 38 |
+
]
|
| 39 |
+
# help text for the end user explaining Clean Cup feature
|
| 40 |
+
CLEAN_CUP_HELP = (
|
| 41 |
+
"Clean.Cup indicates the absence of off-flavors or defects (higher is better). "
|
| 42 |
+
"Typically scored on the same sensory scale as other cup attributes."
|
| 43 |
+
)
|
| 44 |
+
# enforcing 0 to 10 possible values for input
|
| 45 |
+
RANGES = {c: (0.0, 10.0) for c in INPUT_COLS}
|
| 46 |
+
|
| 47 |
+
# ------------------------------------ CITED BLOCK --------------------------------------------------------------------
|
| 48 |
+
# implemented using ChatGPT (conversation 2025-11-23) to help normalize free-form user input into numeric values within range
|
| 49 |
+
# convert user values to allowed 0 - 10 range to avoid errors/crashes: handles blanks, strings, noisy input by stripping chars
|
| 50 |
+
# and sets None for missing / invalid entries (JSON's null)
|
| 51 |
+
def coerce_and_clamp_dict(row: Dict[str, Any]) -> Dict[str, Any]:
|
| 52 |
+
# out = {}
|
| 53 |
+
out: Dict[str, Any] = {}
|
| 54 |
+
# iterates over 8 input columns
|
| 55 |
+
for k in INPUT_COLS:
|
| 56 |
+
v = row.get(k, "")
|
| 57 |
+
# if a value user types is blank or string - converts it into np.nan
|
| 58 |
+
# or if user types something like "7.5pts" it strips the letters and keeps the number
|
| 59 |
+
if v is None or (isinstance(v, str) and v.strip() == ""):
|
| 60 |
+
# out[k] = np.nan
|
| 61 |
+
out[k] = None
|
| 62 |
+
continue
|
| 63 |
+
# tries to convert to float
|
| 64 |
+
fv = None
|
| 65 |
+
try:
|
| 66 |
+
fv = float(v)
|
| 67 |
+
except Exception:
|
| 68 |
+
# try to strip out non-digit characters (e.g. "7.5pts" -> "7.5")
|
| 69 |
+
try:
|
| 70 |
+
cleaned = "".join(ch for ch in str(v) if (ch.isdigit() or ch in ".-"))
|
| 71 |
+
fv = float(cleaned) if cleaned not in ("", ".", "-") else None
|
| 72 |
+
except Exception:
|
| 73 |
+
fv = None
|
| 74 |
+
# if conversion failed -> None
|
| 75 |
+
if fv is None or (isinstance(fv, float) and (math.isnan(fv) or math.isinf(fv))):
|
| 76 |
+
out[k] = None
|
| 77 |
+
continue
|
| 78 |
+
# once we have a clean numeric - it is clamped to be within [0,10] range of valid inputs
|
| 79 |
+
# if user typed 13 it will be clmaped to 10
|
| 80 |
+
# if user typed -2 it will become 0
|
| 81 |
+
lo, hi = RANGES.get(k, (None, None))
|
| 82 |
+
if lo is not None and hi is not None:
|
| 83 |
+
fv = max(lo, min(hi, fv))
|
| 84 |
+
out[k] = float(fv)
|
| 85 |
+
# returns a clean dict to be sent to server
|
| 86 |
+
return out
|
| 87 |
+
|
| 88 |
+
# ChatGPT 5.1 used to prototype this recursive JSON-sanitizer
|
| 89 |
+
# This function recursively walks nested containers (dicts, lists, tuples) and ensures any nested
|
| 90 |
+
# structure (e.g. {"payload": [{"Aroma": np.nan}]}) becomes JSON-safe everywhere, not just the top level
|
| 91 |
+
def make_json_safe(obj):
|
| 92 |
+
# dict
|
| 93 |
+
if isinstance(obj, dict):
|
| 94 |
+
return {k: make_json_safe(v) for k, v in obj.items()}
|
| 95 |
+
# list/tuple
|
| 96 |
+
if isinstance(obj, (list, tuple)):
|
| 97 |
+
return [make_json_safe(v) for v in obj]
|
| 98 |
+
# numpy scalar -> python scalar
|
| 99 |
+
try:
|
| 100 |
+
import numpy as _np
|
| 101 |
+
if isinstance(obj, _np.generic):
|
| 102 |
+
return make_json_safe(obj.item())
|
| 103 |
+
except Exception:
|
| 104 |
+
pass
|
| 105 |
+
# floats: map NaN/Inf -> None
|
| 106 |
+
if isinstance(obj, float):
|
| 107 |
+
if math.isnan(obj) or math.isinf(obj):
|
| 108 |
+
return None
|
| 109 |
+
return float(obj)
|
| 110 |
+
# ints, bool, str, None: ok
|
| 111 |
+
if isinstance(obj, (int, bool, str)) or obj is None:
|
| 112 |
+
return obj
|
| 113 |
+
# fallback
|
| 114 |
+
try:
|
| 115 |
+
return str(obj)
|
| 116 |
+
except Exception:
|
| 117 |
+
return None
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
# ------------------------------------------ END CITED BLOCK ------------------------------------------------
|
| 121 |
+
|
| 122 |
+
# helper function that returns True if every value in a row is null or numeric 0, otherwise - False
|
| 123 |
+
def _row_is_all_null_or_zero(row: Dict[str, Any]) -> bool:
|
| 124 |
+
for v in row.values():
|
| 125 |
+
# missing/null -> keep scanning (counts as "no numeric input")
|
| 126 |
+
if v is None:
|
| 127 |
+
continue
|
| 128 |
+
# numeric non-zero -> row is VALID
|
| 129 |
+
if isinstance(v, (int, float)) and v != 0:
|
| 130 |
+
return False
|
| 131 |
+
# anything else (string, etc) is considered missing/invalid; continue
|
| 132 |
+
# but coerce_and_clamp_dict should have converted those to None or numeric
|
| 133 |
+
return True
|
| 134 |
+
|
| 135 |
+
# sends JSON to server endpoint, returns a tuple (predictions list, raw resposnse/error)
|
| 136 |
+
def call_api_named(payload_rows: List[Dict[str, Any]]):
|
| 137 |
+
# sanitize payload so it's JSON-serializable and uses `null` for missing
|
| 138 |
+
safe_body = {"rows": make_json_safe(payload_rows)}
|
| 139 |
+
try:
|
| 140 |
+
payload_str = json.dumps(safe_body)
|
| 141 |
+
except Exception as e:
|
| 142 |
+
return None, f"Serialization error: {e}"
|
| 143 |
+
# tries calling POST to get predictions using requests lib
|
| 144 |
+
headers = {"Content-Type": "application/json"}
|
| 145 |
+
try:
|
| 146 |
+
response = requests.post(API_URL, data=payload_str, headers=headers, timeout=10) # timeout at 10 sec to avoid hanging
|
| 147 |
+
response.raise_for_status()
|
| 148 |
+
# returns prediction list and full raw text response to be used within debug box on SUCCESS (200 OK)
|
| 149 |
+
return response.json().get("predictions", []), response.text
|
| 150 |
+
except Exception as e:
|
| 151 |
+
return None, f"API error: {e}" # on error return None
|
| 152 |
+
|
| 153 |
+
#prettifies prediction and debug JSON
|
| 154 |
+
def predict_from_rows_of_dicts(rows_of_dicts: List[Dict[str, Any]]):
|
| 155 |
+
payload_rows = [coerce_and_clamp_dict(row) for row in rows_of_dicts]
|
| 156 |
+
# decide whether submission is allowed:
|
| 157 |
+
# - if every submitted row is all-null-or-zero, refuse
|
| 158 |
+
all_rows_invalid = all(_row_is_all_null_or_zero(r) for r in payload_rows)
|
| 159 |
+
if all_rows_invalid:
|
| 160 |
+
debug = {"payload": payload_rows, "response_raw": "skipped - all values missing or zero"}
|
| 161 |
+
return "Please enter at least one numeric attribute (non-zero) before submitting.", json.dumps(debug, indent=2)
|
| 162 |
+
# Otherwise proceed and call API (allowed if at least one row has a non-zero numeric)
|
| 163 |
+
preds, raw = call_api_named(payload_rows)
|
| 164 |
+
# building a debug dictionary containing both payload and raw server response
|
| 165 |
+
debug = {"payload": payload_rows, "response_raw": raw}
|
| 166 |
+
# if API fails - return empty prediction and debug JSON for debugging
|
| 167 |
+
if preds is None:
|
| 168 |
+
return "", json.dumps(debug, indent=2)
|
| 169 |
+
# prettifying predictions upon successful call to be user-friendly
|
| 170 |
+
prettified_pred = [f"Predicted Coffee Quality Points = {round(float(p), 1)}" for p in preds] # rounding predictions to 1 decimal place (user friendly)
|
| 171 |
+
#returns prettified prediction and debug JSON for debug box
|
| 172 |
+
return "\n".join(prettified_pred), json.dumps(debug, indent=2)
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
def predict_from_table(table):
|
| 176 |
+
rows_of_dicts = table_to_list_of_dicts(table)
|
| 177 |
+
return predict_from_rows_of_dicts(rows_of_dicts)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
# ------------------------------------ CITED BLOCK -------------------------------------
|
| 181 |
+
# ChatGPT was used on 11/23/2025 to fix this function due to encountering errors to help deal
|
| 182 |
+
# with 2 possible incoming formats: Dataframe and list of lists.
|
| 183 |
+
|
| 184 |
+
# helper function puts input into proper expected by server format of list-of-dicts keyed by INPUT_COLS:
|
| 185 |
+
# [{"Aroma": 7.5, "Flavor": 8.0, ...}];
|
| 186 |
+
# fills missing columns with empty strings so coerce_and_clamp_dict() can convert them to np.nan
|
| 187 |
+
def table_to_list_of_dicts(table):
|
| 188 |
+
# if table passed in is an instance of Dataframe obj - turn it into a dict
|
| 189 |
+
if isinstance(table, pd.DataFrame):
|
| 190 |
+
df = table
|
| 191 |
+
return [df.iloc[i].to_dict() for i in range(len(df))]
|
| 192 |
+
# else - assume table is a list of lists and manually pair each element to corresponding column
|
| 193 |
+
rows = []
|
| 194 |
+
for row in table:
|
| 195 |
+
# ensure row has right length
|
| 196 |
+
vals = list(row) + [""] * max(0, len(INPUT_COLS) - len(row))
|
| 197 |
+
rows.append({col: vals[i] for i, col in enumerate(INPUT_COLS)})
|
| 198 |
+
return rows
|
| 199 |
+
# ------------------------------- END CITED BLOCK -------------------------------------------
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
# -------------------------------- Gradio UI ------------------------------------------------------
|
| 203 |
+
with gr.Blocks(title="Coffee Quality Points Estimator") as demo:
|
| 204 |
+
# inline HTML/CSS to style user instructions
|
| 205 |
+
gr.Markdown("<h1 style='text-align:center;color:#08306B'>Coffee Quality Points Estimator</h1>")
|
| 206 |
+
gr.Markdown(
|
| 207 |
+
"<div style='font-size:17px;font-weight:700;color:#2b6cb0'>"
|
| 208 |
+
"Instructions: Fill the known sensory attributes (0–10). Leave unknowns blank and the model will "
|
| 209 |
+
"attempt to infer missing values. Then click <b style='color:#ff6600'>Submit</b> to estimate the "
|
| 210 |
+
"<b>Coffee Quality Points</b> (Total.Cup.Points). Higher scores mean better coffee quality.</div>"
|
| 211 |
+
)
|
| 212 |
+
|
| 213 |
+
with gr.Row():
|
| 214 |
+
# presents 1 row by default with INPUT_COLS
|
| 215 |
+
df_input = gr.Dataframe(
|
| 216 |
+
headers=INPUT_COLS,
|
| 217 |
+
value=[["" for _ in INPUT_COLS]], # list of lists to avoid validation errors encountered on testing
|
| 218 |
+
# ------------------------- ChatGPT 5.1 was used to fix the issues on 11/23/2025 ---------------------
|
| 219 |
+
row_count=1,
|
| 220 |
+
col_count=len(INPUT_COLS),
|
| 221 |
+
interactive=True,
|
| 222 |
+
label="Enter Known Columns (0–10 range; numeric values preferred)"
|
| 223 |
+
)
|
| 224 |
+
|
| 225 |
+
with gr.Row():
|
| 226 |
+
submit_btn = gr.Button("Submit", variant="primary")
|
| 227 |
+
|
| 228 |
+
with gr.Row():
|
| 229 |
+
# short prediction for the user
|
| 230 |
+
pred_out = gr.Textbox(label="Predicted Coffee Quality Points", lines=1, interactive=False)
|
| 231 |
+
|
| 232 |
+
with gr.Row():
|
| 233 |
+
# full debug info for developer
|
| 234 |
+
debug_out = gr.Textbox(label="Debug (payload + raw response)", lines=10, interactive=False)
|
| 235 |
+
|
| 236 |
+
with gr.Row():
|
| 237 |
+
gr.Markdown(f"<b>Note:</b> <i>{CLEAN_CUP_HELP}</i>")
|
| 238 |
+
|
| 239 |
+
# When user clicks Submit, Gradio sends the contents of the table to table_to_list_of_dicts().
|
| 240 |
+
# the content can either be a Dataframe or list of lists and the helper function can handle both
|
| 241 |
+
# making the format consistent with FastAPI expectations
|
| 242 |
+
def submit_table(table):
|
| 243 |
+
rows_of_dicts = table_to_list_of_dicts(table)
|
| 244 |
+
return predict_from_rows_of_dicts(rows_of_dicts)
|
| 245 |
+
|
| 246 |
+
# fires up the actual prediction
|
| 247 |
+
submit_btn.click(predict_from_table, inputs=[df_input], outputs=[pred_out, debug_out])
|
| 248 |
+
|
| 249 |
+
if __name__ == "__main__":
|
| 250 |
+
# auto opens the demo in browser
|
| 251 |
+
demo.launch()
|
config.yaml
ADDED
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
data:
|
| 2 |
+
# url empty for now so script will default to local file; modify later as needed
|
| 3 |
+
url: "https://storage.googleapis.com/coffee-quality-data/preprocessed_data.csv"
|
| 4 |
+
local_path: "data/raw/raw_data.csv"
|
| 5 |
+
preprocessed_path: "data/preprocessed/preprocessed_data.csv"
|
| 6 |
+
target: "Total.Cup.Points"
|
| 7 |
+
input_columns:
|
| 8 |
+
- Number.of.Bags
|
| 9 |
+
- Category.One.Defects
|
| 10 |
+
- Category.Two.Defects
|
| 11 |
+
- Aroma
|
| 12 |
+
- Flavor
|
| 13 |
+
- Aftertaste
|
| 14 |
+
- Acidity
|
| 15 |
+
- Body
|
| 16 |
+
- Balance
|
| 17 |
+
- Uniformity
|
| 18 |
+
- Clean.Cup
|
| 19 |
+
- Sweetness
|
| 20 |
+
- Cupper.Points
|
| 21 |
+
- Moisture
|
| 22 |
+
- Quakers
|
| 23 |
+
- altitude_low_meters
|
| 24 |
+
- altitude_high_meters
|
| 25 |
+
- altitude_mean_meters
|
| 26 |
+
- Species
|
| 27 |
+
- Owner
|
| 28 |
+
- Country.of.Origin
|
| 29 |
+
- Mill
|
| 30 |
+
- ICO.Number
|
| 31 |
+
- Company
|
| 32 |
+
- Altitude
|
| 33 |
+
- Region
|
| 34 |
+
- Producer
|
| 35 |
+
- Bag.Weight
|
| 36 |
+
- In.Country.Partner
|
| 37 |
+
- Harvest.Year
|
| 38 |
+
- Grading.Date
|
| 39 |
+
- Owner.1
|
| 40 |
+
- Variety
|
| 41 |
+
- Processing.Method
|
| 42 |
+
- Color
|
| 43 |
+
- Expiration
|
| 44 |
+
- Certification.Body
|
| 45 |
+
- Certification.Address
|
| 46 |
+
- Certification.Contact
|
| 47 |
+
- unit_of_measurement
|
| 48 |
+
|
| 49 |
+
# model details to be added later during train.py work
|
| 50 |
+
train:
|
| 51 |
+
test_size: 0.2
|
| 52 |
+
random_state: 42
|
| 53 |
+
model_params:
|
| 54 |
+
n_estimators: 100
|
| 55 |
+
random_state: 42
|
| 56 |
+
n_jobs: -1
|
| 57 |
+
|
| 58 |
+
paths:
|
| 59 |
+
X_train: "data/cleaned/X_train.csv"
|
| 60 |
+
X_test: "data/cleaned/X_test.csv"
|
| 61 |
+
y_train: "data/cleaned/y_train.csv"
|
| 62 |
+
y_test: "data/cleaned/y_test.csv"
|
| 63 |
+
|
| 64 |
+
artifacts:
|
| 65 |
+
model: "artifacts/model.joblib"
|
| 66 |
+
preprocessor: "artifacts/preprocessor.joblib"
|
| 67 |
+
metrics: "artifacts/metrics.json"
|
| 68 |
+
# The above snippet was generated by chatGPT 5.1 at 10:20p at 11/20/25.
|
| 69 |
+
|
| 70 |
+
api_url:
|
| 71 |
+
# FastAPI: "http://127.0.0.1:8000/predict_named"
|
| 72 |
+
FastAPI: "https://coffee-api-354131048216.us-central1.run.app/predict_named"
|
| 73 |
+
|
requirements.txt
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi>=0.95
|
| 2 |
+
uvicorn[standard]>=0.22.0
|
| 3 |
+
pydantic>=1.10
|
| 4 |
+
PyYAML==6.0
|
| 5 |
+
joblib==1.3.2
|
| 6 |
+
scikit-learn==1.7.2
|
| 7 |
+
numpy==1.26.4
|
| 8 |
+
pandas==2.2.2
|
| 9 |
+
gradio==3.41.0
|
| 10 |
+
requests==2.31.0
|
| 11 |
+
wandb==0.23.0
|
| 12 |
+
|
| 13 |
+
# test / dev tools
|
| 14 |
+
pytest>=7.4
|