🩺 Diabetes β€” Risk Classification (Logistic Regression)

A Logistic Regression classifier trained on the Diabetes dataset from Azure Open Datasets to predict Risk β€” specifically, whether a patient has a high risk of diabetes progression (Y > 211.5).

Built and deployed on Microsoft Fabric during Offline Workshop Training β€” organized by Microsoft Elevate and Dicoding.

πŸ“Š Model Details

Property Value
Model Type Logistic Regression
Framework scikit-learn
Task Binary Classification
Target Variable Risk (0 = Low, 1 = High, threshold: Y > 211.5)
Training Platform Microsoft Fabric + MLflow
Dataset Diabetes (Azure Open Datasets)
Total Samples 442
Risk Prevalence ~22% (imbalanced)
Train/Test Split 70/30 (random_state=0)

πŸ“ Features (10)

Feature Type Description
AGE int Age of patient
SEX int Gender
BMI float Body Mass Index
BP float Average Blood Pressure
S1 int Total Serum Cholesterol (tc)
S2 float Low-Density Lipoproteins (ldl)
S3 float High-Density Lipoproteins (hdl)
S4 float Total Cholesterol / HDL (tch)
S5 float Log of Serum Triglycerides (ltg)
S6 int Blood Sugar Level (glu)

πŸ“ˆ Performance

Best Model: Logistic Regression (by ROC AUC)

Metric Score
ROC AUC 0.7888
Accuracy 0.7444
Precision 0.4286
Recall 0.5172
F1 0.4688
Log Loss 0.5103

All Models Compared

Model ROC AUC F1 Accuracy
Logistic Regression 0.7888 0.4688 0.7444
Random Forest 0.7850 0.5161 0.7744
Gradient Boosting 0.7795 0.5625 0.7895
XGBoost 0.7732 0.5075 0.7519

⚠️ Note on metrics: F1 and Precision appear moderate due to class imbalance (~22% positive class). ROC AUC of 0.79 indicates good discriminative ability. Logistic Regression achieves the best AUC due to the small dataset size (442 rows).

πŸ’» Usage

import pickle
import numpy as np

# Load model (ensure model.pkl is in your path)
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

# Input sample: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

# 1. Predict class
prediction = model.predict(sample)
print(f"Risk Prediction: {'HIGH RISK' if prediction[0] == 1 else 'LOW RISK'}")

# 2. Predict probability
proba = model.predict_proba(sample)
print(f"Risk Probability: {proba[0][1]:.2%}")

πŸ” Key Insights

  • S5 (log of serum triglycerides) is the strongest predictor of diabetes risk (|coef|: 0.892).
  • SEX and BMI rank 2nd and 3rd β€” consistent with medical literature.
  • All 4 models have similar ROC AUC (~0.77-0.79), indicating the feature set has a ceiling for this task.
  • Logistic Regression wins on AUC; Gradient Boosting wins on F1 and Accuracy β€” model choice depends on the specific use case (probability vs. hard classification).

βš–οΈ Feature Importance

Ranked by the absolute value of coefficients:

Rank Feature Coef (Abs) Impact
1 S5 0.892 ⭐⭐⭐⭐⭐
2 SEX 0.560 ⭐⭐⭐
3 BMI 0.236 ⭐⭐
4 S3 0.180 ⭐
5 S2 0.099
6 S1 0.098
7 S4 0.096
8 BP 0.061
9 AGE 0.013
10 S6 0.006

⚠️ Intended Use

  • Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
  • Not intended for: Clinical decision-making without further validation.

πŸ™Œ Acknowledgments

  • Microsoft Elevate and Dicoding β€” for organizing Offline Workshop Training.
  • Azure Open Datasets β€” for providing the Diabetes dataset.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support