metadata
license: mit
language:
- en
tags:
- tabular-regression
- scikit-learn
- linear-regression
- microsoft-fabric
- mlflow
- diabetes
- healthcare
datasets:
- azure-open-datasets/diabetes
metrics:
- r2
- mae
- rmse
library_name: sklearn
pipeline_tag: tabular-regression
π Diabetes β Disease Progression Prediction (Linear Regression)
A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).
Built and deployed on Microsoft Fabric during Offline Workshop Training β organized by Microsoft Elevate and Dicoding.
π Model Details
| Property | Value |
|---|---|
| Model Type | Linear Regression |
| Framework | scikit-learn |
| Task | Tabular Regression |
| Target Variable | Y (disease progression, continuous) |
| Training Platform | Microsoft Fabric + MLflow |
| Dataset | Diabetes (Azure Open Datasets) |
| Total Samples | 442 |
| Train/Test Split | 70/30 (random_state=0) |
π Features (10)
| Feature | Type | Description |
|---|---|---|
AGE |
int | Age of patient |
SEX |
int | Gender |
BMI |
float | Body Mass Index |
BP |
float | Average Blood Pressure |
S1 |
int | Total Serum Cholesterol (tc) |
S2 |
float | Low-Density Lipoproteins (ldl) |
S3 |
float | High-Density Lipoproteins (hdl) |
S4 |
float | Total Cholesterol / HDL (tch) |
S5 |
float | Log of Serum Triglycerides (ltg) |
S6 |
int | Blood Sugar Level (glu) |
π Performance
Best Model: Linear Regression
| Metric | Score |
|---|---|
| RΒ² (Coefficient of Determination) | 0.3929 |
| MAE (Mean Absolute Error) | 44.62 |
| RMSE (Root Mean Squared Error) | 55.65 |
| CV RΒ² (5-fold) | 0.4823 Β± 0.0493 |
All Models Compared
| Model | RΒ² | MAE | RMSE |
|---|---|---|---|
| Linear Regression | 0.3929 | 44.62 | 55.65 |
| Random Forest | 0.3011 | 47.86 | 59.71 |
| XGBoost | 0.2026 | 48.93 | 63.78 |
| Gradient Boosting | 0.1823 | 51.44 | 64.59 |
βΉοΈ Note: An RΒ² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.
π» Usage
import pickle
import numpy as np
# Load model (ensure model.pkl is in the directory)
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
# Example: Patient with average stats
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])
# Predict Disease Progression
prediction = model.predict(sample)
print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")
π Key Insights
- S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient:
65.8), indicating a strong correlation with disease progression. - SEX and BMI are the 2nd and 3rd most influential features.
- Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (
n=442). - Stability: Cross-validation shows moderate stability (
CV RΒ² = 0.48 Β± 0.05), suggesting the model is robust within its performance range.
βοΈ Feature Importance
Ranked by the absolute value of coefficients:
| Rank | Feature | Coef (Abs) | Impact |
|---|---|---|---|
| 1 | S5 | 65.807 | βββββ |
| 2 | SEX | 18.445 | βββ |
| 3 | BMI | 6.246 | ββ |
| 4 | S4 | 3.196 | β |
| 5 | BP | 0.938 | |
| 6 | S1 | 0.694 | |
| 7 | S2 | 0.378 | |
| 8 | S3 | 0.257 | |
| 9 | AGE | 0.191 | |
| 10 | S6 | 0.111 |
β οΈ Intended Use
- Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
- Not intended for: Clinical decision-making without further validation.
π Acknowledgments
- Microsoft Elevate and Dicoding β for organizing Offline Workshop Training.
- Azure Open Datasets β for providing the Diabetes dataset.