--- license: mit language: - en tags: - tabular-regression - scikit-learn - linear-regression - microsoft-fabric - mlflow - diabetes - healthcare datasets: - azure-open-datasets/diabetes metrics: - r2 - mae - rmse library_name: sklearn pipeline_tag: tabular-regression --- # 📉 Diabetes — Disease Progression Prediction (Linear Regression) A **Linear Regression** model trained on the **Diabetes dataset** from Azure Open Datasets to predict **Y** (a quantitative measure of disease progression one year after baseline). Built and deployed on **Microsoft Fabric** during **Offline Workshop Training** — organized by **Microsoft Elevate** and **Dicoding**. ## 📊 Model Details | Property | Value | | :--- | :--- | | **Model Type** | Linear Regression | | **Framework** | scikit-learn | | **Task** | Tabular Regression | | **Target Variable** | Y (disease progression, continuous) | | **Training Platform** | Microsoft Fabric + MLflow | | **Dataset** | Diabetes (Azure Open Datasets) | | **Total Samples** | 442 | | **Train/Test Split** | 70/30 (`random_state=0`) | ## 📝 Features (10) | Feature | Type | Description | | :--- | :--- | :--- | | `AGE` | int | Age of patient | | `SEX` | int | Gender | | `BMI` | float | Body Mass Index | | `BP` | float | Average Blood Pressure | | `S1` | int | Total Serum Cholesterol (tc) | | `S2` | float | Low-Density Lipoproteins (ldl) | | `S3` | float | High-Density Lipoproteins (hdl) | | `S4` | float | Total Cholesterol / HDL (tch) | | `S5` | float | Log of Serum Triglycerides (ltg) | | `S6` | int | Blood Sugar Level (glu) | ## 📈 Performance ### Best Model: Linear Regression | Metric | Score | | :--- | :--- | | **R² (Coefficient of Determination)** | **0.3929** | | **MAE (Mean Absolute Error)** | 44.62 | | **RMSE (Root Mean Squared Error)** | 55.65 | | **CV R² (5-fold)** | 0.4823 ± 0.0493 | ### All Models Compared | Model | R² | MAE | RMSE | | :--- | :--- | :--- | :--- | | **Linear Regression** | **0.3929** | **44.62** | **55.65** | | Random Forest | 0.3011 | 47.86 | 59.71 | | XGBoost | 0.2026 | 48.93 | 63.78 | | Gradient Boosting | 0.1823 | 51.44 | 64.59 | > **ℹ️ Note:** An R² of **~0.39** is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, **Linear models outperform tree-based models** here due to the small sample size (442 rows), avoiding overfitting. ## 💻 Usage ```python import pickle import numpy as np # Load model (ensure model.pkl is in the directory) with open("model.pkl", "rb") as f: model = pickle.load(f) # Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6] # Example: Patient with average stats sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]]) # Predict Disease Progression prediction = model.predict(sample) print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}") ``` ## 🔍 Key Insights * **S5 (Log of Serum Triglycerides)** is the most important predictor by far (Coefficient: `65.8`), indicating a strong correlation with disease progression. * **SEX** and **BMI** are the 2nd and 3rd most influential features. * **Simplicity wins:** Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (`n=442`). * **Stability:** Cross-validation shows moderate stability (`CV R² = 0.48 ± 0.05`), suggesting the model is robust within its performance range. ## ⚖️ Feature Importance Ranked by the absolute value of coefficients: | Rank | Feature | Coef (Abs) | Impact | | --- | --- | --- | --- | | 1 | **S5** | 65.807 | ⭐⭐⭐⭐⭐ | | 2 | **SEX** | 18.445 | ⭐⭐⭐ | | 3 | **BMI** | 6.246 | ⭐⭐ | | 4 | **S4** | 3.196 | ⭐ | | 5 | **BP** | 0.938 | | | 6 | **S1** | 0.694 | | | 7 | **S2** | 0.378 | | | 8 | **S3** | 0.257 | | | 9 | **AGE** | 0.191 | | | 10 | **S6** | 0.111 | | ## ⚠️ Intended Use * **Primary:** Educational / demonstration of ML workflow on Microsoft Fabric. * **Not intended for:** Clinical decision-making without further validation. ## 🙌 Acknowledgments * **Microsoft Elevate** and **Dicoding** — for organizing Offline Workshop Training. * **Azure Open Datasets** — for providing the Diabetes dataset. ``` ```