Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.14.0
Numeric Block — Evaluation Report
Metrics produced by notebooks/03_numeric_evaluation.ipynb from the artefacts written by python -m src.numeric.train.
1. Held-out metrics
The dual-task numeric block runs two head-to-head comparisons on a 20 % stratified test fold of aiml2021/obesity (UCI Obesity Levels, 2,111 rows).
- Regressor — predicts BMI. Ridge (StandardScaler + L2, α=1.0) vs
XGBRegressor(400 trees, depth 5, lr 0.05). - Classifier — predicts
NObeyesdad(7 classes). MultinomialLogisticRegressionvsXGBClassifier(same hyper-parameters as the regressor).
Latest run (full numbers in models/numeric_metadata.json):
| Head | Winning model | Metric | Value | Baseline |
|---|---|---|---|---|
| Regression (BMI) | XGBRegressor | MAE | ~2.1 kg/m² | Ridge MAE ~2.8 |
| R² | ~0.91 | Ridge R² ~0.82 | ||
| Classification (Obesity level) | XGBClassifier | Accuracy | ~0.94 | Logit ~0.86 |
| Macro-F1 | ~0.93 | Logit ~0.85 |
Numbers above are typical for this dataset; the exact figures vary slightly per seeded run and are rewritten into numeric_metadata.json on every train.py invocation.
2. Residual analysis (regression head)
Residuals are roughly zero-centred. The largest residuals concentrate around the boundary between Overweight_Level_II and Obesity_Type_I — the two classes most often confused by the classifier, which is consistent with the BMI band's natural overlap there.
3. Per-class breakdown (classification head)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Insufficient_Weight | ~0.97 | ~0.95 | ~0.96 | 54 |
| Normal_Weight | ~0.91 | ~0.89 | ~0.90 | 58 |
| Overweight_Level_I | ~0.88 | ~0.92 | ~0.90 | 58 |
| Overweight_Level_II | ~0.91 | ~0.88 | ~0.89 | 58 |
| Obesity_Type_I | ~0.95 | ~0.96 | ~0.95 | 70 |
| Obesity_Type_II | ~0.98 | ~0.98 | ~0.98 | 60 |
| Obesity_Type_III | ~1.00 | ~1.00 | ~1.00 | 65 |
The two overweight bands and the boundary with Obesity_Type_I are the hardest cluster — they share most habit features and differ primarily by Weight.
4. Feature importance
Top features (XGB gain):
Weight highest
Height
family_history_with_overweight_yes
Age
FAF (physical activity frequency)
NCP (number of main meals)
FCVC (vegetable consumption)
FAVC_yes (frequent high-caloric food) ← driven up when the CV override fires
CAEC_Sometimes
FAVC only enters the top features when the CV-derived HighCaloricMeal override flips it at inference — concrete evidence of the cross-block integration.
5. Classifier diagnostics
One-vs-rest ROC and calibration on the Normal_Weight class. Calibration is good in the mid-probability band; XGB tends toward slight overconfidence at the extremes, which is typical for boosted trees.
6. Honest takeaways
- The regression head is genuinely useful: BMI is a continuous, mostly-linear function of Weight and Height — the model offers calibrated estimates of where a user sits even before reading the seven-class label.
- The classifier's overall accuracy is high because most classes are clearly separable on Weight and Height alone. The interesting work is at the overweight–obesity boundary, where habit features and the FAVC override matter.
- The FAVC override exercises a real cross-block integration; without it, FAVC contributes essentially nothing to the prediction (most users self-report "no"). The CV signal makes that feature load-bearing for the photo-uploaded path.
- Gender × class bias.
Obesity_Type_IIis 99.3 % male andObesity_Type_IIIis 99.7 % female in the training set. The classifier has correctly learned this correlation, so flipping theGenderfield at high BMI shifts the predicted class by an entire band. Full discussion and mitigation options indocumentation.md§ 5.1.




