| | --- |
| | license: mit |
| | language: |
| | - en |
| | tags: |
| | - tabular-regression |
| | - scikit-learn |
| | - linear-regression |
| | - microsoft-fabric |
| | - mlflow |
| | - diabetes |
| | - healthcare |
| | datasets: |
| | - azure-open-datasets/diabetes |
| | metrics: |
| | - r2 |
| | - mae |
| | - rmse |
| | library_name: sklearn |
| | pipeline_tag: tabular-regression |
| | --- |
| | |
| | # 📉 Diabetes — Disease Progression Prediction (Linear Regression) |
| |
|
| | A **Linear Regression** model trained on the **Diabetes dataset** from Azure Open Datasets to predict **Y** (a quantitative measure of disease progression one year after baseline). |
| |
|
| | Built and deployed on **Microsoft Fabric** during **Offline Workshop Training** — organized by **Microsoft Elevate** and **Dicoding**. |
| |
|
| | ## 📊 Model Details |
| |
|
| | | Property | Value | |
| | | :--- | :--- | |
| | | **Model Type** | Linear Regression | |
| | | **Framework** | scikit-learn | |
| | | **Task** | Tabular Regression | |
| | | **Target Variable** | Y (disease progression, continuous) | |
| | | **Training Platform** | Microsoft Fabric + MLflow | |
| | | **Dataset** | Diabetes (Azure Open Datasets) | |
| | | **Total Samples** | 442 | |
| | | **Train/Test Split** | 70/30 (`random_state=0`) | |
| |
|
| | ## 📝 Features (10) |
| |
|
| | | Feature | Type | Description | |
| | | :--- | :--- | :--- | |
| | | `AGE` | int | Age of patient | |
| | | `SEX` | int | Gender | |
| | | `BMI` | float | Body Mass Index | |
| | | `BP` | float | Average Blood Pressure | |
| | | `S1` | int | Total Serum Cholesterol (tc) | |
| | | `S2` | float | Low-Density Lipoproteins (ldl) | |
| | | `S3` | float | High-Density Lipoproteins (hdl) | |
| | | `S4` | float | Total Cholesterol / HDL (tch) | |
| | | `S5` | float | Log of Serum Triglycerides (ltg) | |
| | | `S6` | int | Blood Sugar Level (glu) | |
| |
|
| | ## 📈 Performance |
| |
|
| | ### Best Model: Linear Regression |
| |
|
| | | Metric | Score | |
| | | :--- | :--- | |
| | | **R² (Coefficient of Determination)** | **0.3929** | |
| | | **MAE (Mean Absolute Error)** | 44.62 | |
| | | **RMSE (Root Mean Squared Error)** | 55.65 | |
| | | **CV R² (5-fold)** | 0.4823 ± 0.0493 | |
| |
|
| | ### All Models Compared |
| |
|
| | | Model | R² | MAE | RMSE | |
| | | :--- | :--- | :--- | :--- | |
| | | **Linear Regression** | **0.3929** | **44.62** | **55.65** | |
| | | Random Forest | 0.3011 | 47.86 | 59.71 | |
| | | XGBoost | 0.2026 | 48.93 | 63.78 | |
| | | Gradient Boosting | 0.1823 | 51.44 | 64.59 | |
| |
|
| | > **ℹ️ Note:** An R² of **~0.39** is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, **Linear models outperform tree-based models** here due to the small sample size (442 rows), avoiding overfitting. |
| |
|
| | ## 💻 Usage |
| |
|
| | ```python |
| | import pickle |
| | import numpy as np |
| | |
| | # Load model (ensure model.pkl is in the directory) |
| | with open("model.pkl", "rb") as f: |
| | model = pickle.load(f) |
| | |
| | # Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6] |
| | # Example: Patient with average stats |
| | sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]]) |
| | |
| | # Predict Disease Progression |
| | prediction = model.predict(sample) |
| | print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}") |
| | |
| | ``` |
| |
|
| | ## 🔍 Key Insights |
| |
|
| | * **S5 (Log of Serum Triglycerides)** is the most important predictor by far (Coefficient: `65.8`), indicating a strong correlation with disease progression. |
| | * **SEX** and **BMI** are the 2nd and 3rd most influential features. |
| | * **Simplicity wins:** Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (`n=442`). |
| | * **Stability:** Cross-validation shows moderate stability (`CV R² = 0.48 ± 0.05`), suggesting the model is robust within its performance range. |
| |
|
| | ## ⚖️ Feature Importance |
| |
|
| | Ranked by the absolute value of coefficients: |
| |
|
| | | Rank | Feature | Coef (Abs) | Impact | |
| | | --- | --- | --- | --- | |
| | | 1 | **S5** | 65.807 | ⭐⭐⭐⭐⭐ | |
| | | 2 | **SEX** | 18.445 | ⭐⭐⭐ | |
| | | 3 | **BMI** | 6.246 | ⭐⭐ | |
| | | 4 | **S4** | 3.196 | ⭐ | |
| | | 5 | **BP** | 0.938 | | |
| | | 6 | **S1** | 0.694 | | |
| | | 7 | **S2** | 0.378 | | |
| | | 8 | **S3** | 0.257 | | |
| | | 9 | **AGE** | 0.191 | | |
| | | 10 | **S6** | 0.111 | | |
| |
|
| | ## ⚠️ Intended Use |
| |
|
| | * **Primary:** Educational / demonstration of ML workflow on Microsoft Fabric. |
| | * **Not intended for:** Clinical decision-making without further validation. |
| |
|
| | ## 🙌 Acknowledgments |
| |
|
| | * **Microsoft Elevate** and **Dicoding** — for organizing Offline Workshop Training. |
| | * **Azure Open Datasets** — for providing the Diabetes dataset. |
| | ``` |
| | ``` |