🩺 Diabetes — Risk Classification (Logistic Regression)

A Logistic Regression classifier trained on the Diabetes dataset from Azure Open Datasets to predict Risk — specifically, whether a patient has a high risk of diabetes progression (Y > 211.5).

Built and deployed on Microsoft Fabric during Offline Workshop Training — organized by Microsoft Elevate and Dicoding.

📊 Model Details

Property	Value
Model Type	Logistic Regression
Framework	scikit-learn
Task	Binary Classification
Target Variable	Risk (0 = Low, 1 = High, threshold: Y > 211.5)
Training Platform	Microsoft Fabric + MLflow
Dataset	Diabetes (Azure Open Datasets)
Total Samples	442
Risk Prevalence	~22% (imbalanced)
Train/Test Split	70/30 (`random_state=0`)

📝 Features (10)

Feature	Type	Description
`AGE`	int	Age of patient
`SEX`	int	Gender
`BMI`	float	Body Mass Index
`BP`	float	Average Blood Pressure
`S1`	int	Total Serum Cholesterol (tc)
`S2`	float	Low-Density Lipoproteins (ldl)
`S3`	float	High-Density Lipoproteins (hdl)
`S4`	float	Total Cholesterol / HDL (tch)
`S5`	float	Log of Serum Triglycerides (ltg)
`S6`	int	Blood Sugar Level (glu)

📈 Performance

Best Model: Logistic Regression (by ROC AUC)

Metric	Score
ROC AUC	0.7888
Accuracy	0.7444
Precision	0.4286
Recall	0.5172
F1	0.4688
Log Loss	0.5103

All Models Compared

Model	ROC AUC	F1	Accuracy
Logistic Regression	0.7888	0.4688	0.7444
Random Forest	0.7850	0.5161	0.7744
Gradient Boosting	0.7795	0.5625	0.7895
XGBoost	0.7732	0.5075	0.7519

⚠️ Note on metrics: F1 and Precision appear moderate due to class imbalance (~22% positive class). ROC AUC of 0.79 indicates good discriminative ability. Logistic Regression achieves the best AUC due to the small dataset size (442 rows).

💻 Usage

import pickle
import numpy as np

# Load model (ensure model.pkl is in your path)
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input sample: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

# 1. Predict class
prediction = model.predict(sample)
print(f"Risk Prediction: {'HIGH RISK' if prediction[0] == 1 else 'LOW RISK'}")

# 2. Predict probability
proba = model.predict_proba(sample)
print(f"Risk Probability: {proba[0][1]:.2%}")

🔍 Key Insights

S5 (log of serum triglycerides) is the strongest predictor of diabetes risk (|coef|: 0.892).
SEX and BMI rank 2nd and 3rd — consistent with medical literature.
All 4 models have similar ROC AUC (~0.77-0.79), indicating the feature set has a ceiling for this task.
Logistic Regression wins on AUC; Gradient Boosting wins on F1 and Accuracy — model choice depends on the specific use case (probability vs. hard classification).

⚖️ Feature Importance

Ranked by the absolute value of coefficients:

Rank	Feature	Coef (Abs)	Impact
1	S5	0.892	⭐⭐⭐⭐⭐
2	SEX	0.560	⭐⭐⭐
3	BMI	0.236	⭐⭐
4	S3	0.180	⭐
5	S2	0.099
6	S1	0.098
7	S4	0.096
8	BP	0.061
9	AGE	0.013
10	S6	0.006

⚠️ Intended Use

Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
Not intended for: Clinical decision-making without further validation.

🙌 Acknowledgments

Microsoft Elevate and Dicoding — for organizing Offline Workshop Training.
Azure Open Datasets — for providing the Diabetes dataset.

Downloads last month: -