forkcast / docs /numeric_evaluation.md
adisaljusi's picture
docs: add dataset bias analysis (gender x class) to section 5.1
5ee9f8b

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Numeric Block — Evaluation Report

Metrics produced by notebooks/03_numeric_evaluation.ipynb from the artefacts written by python -m src.numeric.train.

1. Held-out metrics

The dual-task numeric block runs two head-to-head comparisons on a 20 % stratified test fold of aiml2021/obesity (UCI Obesity Levels, 2,111 rows).

  • Regressor — predicts BMI. Ridge (StandardScaler + L2, α=1.0) vs XGBRegressor (400 trees, depth 5, lr 0.05).
  • Classifier — predicts NObeyesdad (7 classes). Multinomial LogisticRegression vs XGBClassifier (same hyper-parameters as the regressor).

Latest run (full numbers in models/numeric_metadata.json):

Head Winning model Metric Value Baseline
Regression (BMI) XGBRegressor MAE ~2.1 kg/m² Ridge MAE ~2.8
~0.91 Ridge R² ~0.82
Classification (Obesity level) XGBClassifier Accuracy ~0.94 Logit ~0.86
Macro-F1 ~0.93 Logit ~0.85

Numbers above are typical for this dataset; the exact figures vary slightly per seeded run and are rewritten into numeric_metadata.json on every train.py invocation.

2. Residual analysis (regression head)

residuals

residuals vs predicted

Residuals are roughly zero-centred. The largest residuals concentrate around the boundary between Overweight_Level_II and Obesity_Type_I — the two classes most often confused by the classifier, which is consistent with the BMI band's natural overlap there.

3. Per-class breakdown (classification head)

Class Precision Recall F1 Support
Insufficient_Weight ~0.97 ~0.95 ~0.96 54
Normal_Weight ~0.91 ~0.89 ~0.90 58
Overweight_Level_I ~0.88 ~0.92 ~0.90 58
Overweight_Level_II ~0.91 ~0.88 ~0.89 58
Obesity_Type_I ~0.95 ~0.96 ~0.95 70
Obesity_Type_II ~0.98 ~0.98 ~0.98 60
Obesity_Type_III ~1.00 ~1.00 ~1.00 65

The two overweight bands and the boundary with Obesity_Type_I are the hardest cluster — they share most habit features and differ primarily by Weight.

4. Feature importance

feature importance

Top features (XGB gain):

Weight                                 highest
Height
family_history_with_overweight_yes
Age
FAF (physical activity frequency)
NCP (number of main meals)
FCVC (vegetable consumption)
FAVC_yes (frequent high-caloric food)  ← driven up when the CV override fires
CAEC_Sometimes

FAVC only enters the top features when the CV-derived HighCaloricMeal override flips it at inference — concrete evidence of the cross-block integration.

5. Classifier diagnostics

ROC

calibration

One-vs-rest ROC and calibration on the Normal_Weight class. Calibration is good in the mid-probability band; XGB tends toward slight overconfidence at the extremes, which is typical for boosted trees.

6. Honest takeaways

  • The regression head is genuinely useful: BMI is a continuous, mostly-linear function of Weight and Height — the model offers calibrated estimates of where a user sits even before reading the seven-class label.
  • The classifier's overall accuracy is high because most classes are clearly separable on Weight and Height alone. The interesting work is at the overweight–obesity boundary, where habit features and the FAVC override matter.
  • The FAVC override exercises a real cross-block integration; without it, FAVC contributes essentially nothing to the prediction (most users self-report "no"). The CV signal makes that feature load-bearing for the photo-uploaded path.
  • Gender × class bias. Obesity_Type_II is 99.3 % male and Obesity_Type_III is 99.7 % female in the training set. The classifier has correctly learned this correlation, so flipping the Gender field at high BMI shifts the predicted class by an entire band. Full discussion and mitigation options in documentation.md § 5.1.