Update README.md

48a9d0a verified 9 days ago

4.28 kB

license: mit
language:
  - en
tags:
  - tabular-regression
  - scikit-learn
  - linear-regression
  - microsoft-fabric
  - mlflow
  - diabetes
  - healthcare
datasets:
  - azure-open-datasets/diabetes
metrics:
  - r2
  - mae
  - rmse
library_name: sklearn
pipeline_tag: tabular-regression

📉 Diabetes — Disease Progression Prediction (Linear Regression)

A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).

Built and deployed on Microsoft Fabric during Offline Workshop Training — organized by Microsoft Elevate and Dicoding.

📊 Model Details

Property	Value
Model Type	Linear Regression
Framework	scikit-learn
Task	Tabular Regression
Target Variable	Y (disease progression, continuous)
Training Platform	Microsoft Fabric + MLflow
Dataset	Diabetes (Azure Open Datasets)
Total Samples	442
Train/Test Split	70/30 (`random_state=0`)

📝 Features (10)

Feature	Type	Description
`AGE`	int	Age of patient
`SEX`	int	Gender
`BMI`	float	Body Mass Index
`BP`	float	Average Blood Pressure
`S1`	int	Total Serum Cholesterol (tc)
`S2`	float	Low-Density Lipoproteins (ldl)
`S3`	float	High-Density Lipoproteins (hdl)
`S4`	float	Total Cholesterol / HDL (tch)
`S5`	float	Log of Serum Triglycerides (ltg)
`S6`	int	Blood Sugar Level (glu)

📈 Performance

Best Model: Linear Regression

Metric	Score
R² (Coefficient of Determination)	0.3929
MAE (Mean Absolute Error)	44.62
RMSE (Root Mean Squared Error)	55.65
CV R² (5-fold)	0.4823 ± 0.0493

All Models Compared

Model	R²	MAE	RMSE
Linear Regression	0.3929	44.62	55.65
Random Forest	0.3011	47.86	59.71
XGBoost	0.2026	48.93	63.78
Gradient Boosting	0.1823	51.44	64.59

ℹ️ Note: An R² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.

💻 Usage

import pickle
import numpy as np

# Load model (ensure model.pkl is in the directory)
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
# Example: Patient with average stats
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

# Predict Disease Progression
prediction = model.predict(sample)
print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")

🔍 Key Insights

S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient: 65.8), indicating a strong correlation with disease progression.
SEX and BMI are the 2nd and 3rd most influential features.
Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (n=442).
Stability: Cross-validation shows moderate stability (CV R² = 0.48 ± 0.05), suggesting the model is robust within its performance range.

⚖️ Feature Importance

Ranked by the absolute value of coefficients:

Rank	Feature	Coef (Abs)	Impact
1	S5	65.807	⭐⭐⭐⭐⭐
2	SEX	18.445	⭐⭐⭐
3	BMI	6.246	⭐⭐
4	S4	3.196	⭐
5	BP	0.938
6	S1	0.694
7	S2	0.378
8	S3	0.257
9	AGE	0.191
10	S6	0.111

⚠️ Intended Use

Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
Not intended for: Clinical decision-making without further validation.

🙌 Acknowledgments

Microsoft Elevate and Dicoding — for organizing Offline Workshop Training.
Azure Open Datasets — for providing the Diabetes dataset.