π©Ί Diabetes β Risk Classification (Logistic Regression)
A Logistic Regression classifier trained on the Diabetes dataset from Azure Open Datasets to predict Risk β specifically, whether a patient has a high risk of diabetes progression (Y > 211.5).
Built and deployed on Microsoft Fabric during Offline Workshop Training β organized by Microsoft Elevate and Dicoding.
π Model Details
| Property |
Value |
| Model Type |
Logistic Regression |
| Framework |
scikit-learn |
| Task |
Binary Classification |
| Target Variable |
Risk (0 = Low, 1 = High, threshold: Y > 211.5) |
| Training Platform |
Microsoft Fabric + MLflow |
| Dataset |
Diabetes (Azure Open Datasets) |
| Total Samples |
442 |
| Risk Prevalence |
~22% (imbalanced) |
| Train/Test Split |
70/30 (random_state=0) |
π Features (10)
| Feature |
Type |
Description |
AGE |
int |
Age of patient |
SEX |
int |
Gender |
BMI |
float |
Body Mass Index |
BP |
float |
Average Blood Pressure |
S1 |
int |
Total Serum Cholesterol (tc) |
S2 |
float |
Low-Density Lipoproteins (ldl) |
S3 |
float |
High-Density Lipoproteins (hdl) |
S4 |
float |
Total Cholesterol / HDL (tch) |
S5 |
float |
Log of Serum Triglycerides (ltg) |
S6 |
int |
Blood Sugar Level (glu) |
π Performance
Best Model: Logistic Regression (by ROC AUC)
| Metric |
Score |
| ROC AUC |
0.7888 |
| Accuracy |
0.7444 |
| Precision |
0.4286 |
| Recall |
0.5172 |
| F1 |
0.4688 |
| Log Loss |
0.5103 |
All Models Compared
| Model |
ROC AUC |
F1 |
Accuracy |
| Logistic Regression |
0.7888 |
0.4688 |
0.7444 |
| Random Forest |
0.7850 |
0.5161 |
0.7744 |
| Gradient Boosting |
0.7795 |
0.5625 |
0.7895 |
| XGBoost |
0.7732 |
0.5075 |
0.7519 |
β οΈ Note on metrics: F1 and Precision appear moderate due to class imbalance (~22% positive class). ROC AUC of 0.79 indicates good discriminative ability. Logistic Regression achieves the best AUC due to the small dataset size (442 rows).
π» Usage
import pickle
import numpy as np
with open("model.pkl", "rb") as f:
model = pickle.load(f)
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])
prediction = model.predict(sample)
print(f"Risk Prediction: {'HIGH RISK' if prediction[0] == 1 else 'LOW RISK'}")
proba = model.predict_proba(sample)
print(f"Risk Probability: {proba[0][1]:.2%}")
π Key Insights
- S5 (log of serum triglycerides) is the strongest predictor of diabetes risk (|coef|: 0.892).
- SEX and BMI rank 2nd and 3rd β consistent with medical literature.
- All 4 models have similar ROC AUC (~0.77-0.79), indicating the feature set has a ceiling for this task.
- Logistic Regression wins on AUC; Gradient Boosting wins on F1 and Accuracy β model choice depends on the specific use case (probability vs. hard classification).
βοΈ Feature Importance
Ranked by the absolute value of coefficients:
| Rank |
Feature |
Coef (Abs) |
Impact |
| 1 |
S5 |
0.892 |
βββββ |
| 2 |
SEX |
0.560 |
βββ |
| 3 |
BMI |
0.236 |
ββ |
| 4 |
S3 |
0.180 |
β |
| 5 |
S2 |
0.099 |
|
| 6 |
S1 |
0.098 |
|
| 7 |
S4 |
0.096 |
|
| 8 |
BP |
0.061 |
|
| 9 |
AGE |
0.013 |
|
| 10 |
S6 |
0.006 |
|
β οΈ Intended Use
- Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
- Not intended for: Clinical decision-making without further validation.
π Acknowledgments
- Microsoft Elevate and Dicoding β for organizing Offline Workshop Training.
- Azure Open Datasets β for providing the Diabetes dataset.