kendrickfff's picture
Update README.md
48a9d0a verified
---
license: mit
language:
- en
tags:
- tabular-regression
- scikit-learn
- linear-regression
- microsoft-fabric
- mlflow
- diabetes
- healthcare
datasets:
- azure-open-datasets/diabetes
metrics:
- r2
- mae
- rmse
library_name: sklearn
pipeline_tag: tabular-regression
---
# 📉 Diabetes — Disease Progression Prediction (Linear Regression)
A **Linear Regression** model trained on the **Diabetes dataset** from Azure Open Datasets to predict **Y** (a quantitative measure of disease progression one year after baseline).
Built and deployed on **Microsoft Fabric** during **Offline Workshop Training** — organized by **Microsoft Elevate** and **Dicoding**.
## 📊 Model Details
| Property | Value |
| :--- | :--- |
| **Model Type** | Linear Regression |
| **Framework** | scikit-learn |
| **Task** | Tabular Regression |
| **Target Variable** | Y (disease progression, continuous) |
| **Training Platform** | Microsoft Fabric + MLflow |
| **Dataset** | Diabetes (Azure Open Datasets) |
| **Total Samples** | 442 |
| **Train/Test Split** | 70/30 (`random_state=0`) |
## 📝 Features (10)
| Feature | Type | Description |
| :--- | :--- | :--- |
| `AGE` | int | Age of patient |
| `SEX` | int | Gender |
| `BMI` | float | Body Mass Index |
| `BP` | float | Average Blood Pressure |
| `S1` | int | Total Serum Cholesterol (tc) |
| `S2` | float | Low-Density Lipoproteins (ldl) |
| `S3` | float | High-Density Lipoproteins (hdl) |
| `S4` | float | Total Cholesterol / HDL (tch) |
| `S5` | float | Log of Serum Triglycerides (ltg) |
| `S6` | int | Blood Sugar Level (glu) |
## 📈 Performance
### Best Model: Linear Regression
| Metric | Score |
| :--- | :--- |
| **R² (Coefficient of Determination)** | **0.3929** |
| **MAE (Mean Absolute Error)** | 44.62 |
| **RMSE (Root Mean Squared Error)** | 55.65 |
| **CV R² (5-fold)** | 0.4823 ± 0.0493 |
### All Models Compared
| Model | R² | MAE | RMSE |
| :--- | :--- | :--- | :--- |
| **Linear Regression** | **0.3929** | **44.62** | **55.65** |
| Random Forest | 0.3011 | 47.86 | 59.71 |
| XGBoost | 0.2026 | 48.93 | 63.78 |
| Gradient Boosting | 0.1823 | 51.44 | 64.59 |
> **ℹ️ Note:** An R² of **~0.39** is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, **Linear models outperform tree-based models** here due to the small sample size (442 rows), avoiding overfitting.
## 💻 Usage
```python
import pickle
import numpy as np
# Load model (ensure model.pkl is in the directory)
with open("model.pkl", "rb") as f:
model = pickle.load(f)
# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
# Example: Patient with average stats
sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])
# Predict Disease Progression
prediction = model.predict(sample)
print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")
```
## 🔍 Key Insights
* **S5 (Log of Serum Triglycerides)** is the most important predictor by far (Coefficient: `65.8`), indicating a strong correlation with disease progression.
* **SEX** and **BMI** are the 2nd and 3rd most influential features.
* **Simplicity wins:** Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (`n=442`).
* **Stability:** Cross-validation shows moderate stability (`CV R² = 0.48 ± 0.05`), suggesting the model is robust within its performance range.
## ⚖️ Feature Importance
Ranked by the absolute value of coefficients:
| Rank | Feature | Coef (Abs) | Impact |
| --- | --- | --- | --- |
| 1 | **S5** | 65.807 | ⭐⭐⭐⭐⭐ |
| 2 | **SEX** | 18.445 | ⭐⭐⭐ |
| 3 | **BMI** | 6.246 | ⭐⭐ |
| 4 | **S4** | 3.196 | ⭐ |
| 5 | **BP** | 0.938 | |
| 6 | **S1** | 0.694 | |
| 7 | **S2** | 0.378 | |
| 8 | **S3** | 0.257 | |
| 9 | **AGE** | 0.191 | |
| 10 | **S6** | 0.111 | |
## ⚠️ Intended Use
* **Primary:** Educational / demonstration of ML workflow on Microsoft Fabric.
* **Not intended for:** Clinical decision-making without further validation.
## 🙌 Acknowledgments
* **Microsoft Elevate** and **Dicoding** — for organizing Offline Workshop Training.
* **Azure Open Datasets** — for providing the Diabetes dataset.
```
```