Update README.md

48a9d0a verified 9 days ago

4.28 kB

	---
	license: mit
	language:
	- en
	tags:
	- tabular-regression
	- scikit-learn
	- linear-regression
	- microsoft-fabric
	- mlflow
	- diabetes
	- healthcare
	datasets:
	- azure-open-datasets/diabetes
	metrics:
	- r2
	- mae
	- rmse
	library_name: sklearn
	pipeline_tag: tabular-regression
	---

	# 📉 Diabetes — Disease Progression Prediction (Linear Regression)

	A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).

	Built and deployed on Microsoft Fabric during Offline Workshop Training — organized by Microsoft Elevate and Dicoding.

	## 📊 Model Details

	\| Property \| Value \|
	\| :--- \| :--- \|
	\| Model Type \| Linear Regression \|
	\| Framework \| scikit-learn \|
	\| Task \| Tabular Regression \|
	\| Target Variable \| Y (disease progression, continuous) \|
	\| Training Platform \| Microsoft Fabric + MLflow \|
	\| Dataset \| Diabetes (Azure Open Datasets) \|
	\| Total Samples \| 442 \|
	\| Train/Test Split \| 70/30 (`random_state=0`) \|

	## 📝 Features (10)

	\| Feature \| Type \| Description \|
	\| :--- \| :--- \| :--- \|
	\| `AGE` \| int \| Age of patient \|
	\| `SEX` \| int \| Gender \|
	\| `BMI` \| float \| Body Mass Index \|
	\| `BP` \| float \| Average Blood Pressure \|
	\| `S1` \| int \| Total Serum Cholesterol (tc) \|
	\| `S2` \| float \| Low-Density Lipoproteins (ldl) \|
	\| `S3` \| float \| High-Density Lipoproteins (hdl) \|
	\| `S4` \| float \| Total Cholesterol / HDL (tch) \|
	\| `S5` \| float \| Log of Serum Triglycerides (ltg) \|
	\| `S6` \| int \| Blood Sugar Level (glu) \|

	## 📈 Performance

	### Best Model: Linear Regression

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| R² (Coefficient of Determination) \| 0.3929 \|
	\| MAE (Mean Absolute Error) \| 44.62 \|
	\| RMSE (Root Mean Squared Error) \| 55.65 \|
	\| CV R² (5-fold) \| 0.4823 ± 0.0493 \|

	### All Models Compared

	\| Model \| R² \| MAE \| RMSE \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Linear Regression \| 0.3929 \| 44.62 \| 55.65 \|
	\| Random Forest \| 0.3011 \| 47.86 \| 59.71 \|
	\| XGBoost \| 0.2026 \| 48.93 \| 63.78 \|
	\| Gradient Boosting \| 0.1823 \| 51.44 \| 64.59 \|

	> ℹ️ Note: An R² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.

	## 💻 Usage

	```python
	import pickle
	import numpy as np

	# Load model (ensure model.pkl is in the directory)
	with open("model.pkl", "rb") as f:
	model = pickle.load(f)

	# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
	# Example: Patient with average stats
	sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

	# Predict Disease Progression
	prediction = model.predict(sample)
	print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")

	```

	## 🔍 Key Insights

	* S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient: `65.8`), indicating a strong correlation with disease progression.
	* SEX and BMI are the 2nd and 3rd most influential features.
	* Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (`n=442`).
	* Stability: Cross-validation shows moderate stability (`CV R² = 0.48 ± 0.05`), suggesting the model is robust within its performance range.

	## ⚖️ Feature Importance

	Ranked by the absolute value of coefficients:

	\| Rank \| Feature \| Coef (Abs) \| Impact \|
	\| --- \| --- \| --- \| --- \|
	\| 1 \| S5 \| 65.807 \| ⭐⭐⭐⭐⭐ \|
	\| 2 \| SEX \| 18.445 \| ⭐⭐⭐ \|
	\| 3 \| BMI \| 6.246 \| ⭐⭐ \|
	\| 4 \| S4 \| 3.196 \| ⭐ \|
	\| 5 \| BP \| 0.938 \| \|
	\| 6 \| S1 \| 0.694 \| \|
	\| 7 \| S2 \| 0.378 \| \|
	\| 8 \| S3 \| 0.257 \| \|
	\| 9 \| AGE \| 0.191 \| \|
	\| 10 \| S6 \| 0.111 \| \|

	## ⚠️ Intended Use

	* Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
	* Not intended for: Clinical decision-making without further validation.

	## 🙌 Acknowledgments

	* Microsoft Elevate and Dicoding — for organizing Offline Workshop Training.
	* Azure Open Datasets — for providing the Diabetes dataset.
	```
	```

	---
	license: mit
	language:
	- en
	tags:
	- tabular-regression
	- scikit-learn
	- linear-regression
	- microsoft-fabric
	- mlflow
	- diabetes
	- healthcare
	datasets:
	- azure-open-datasets/diabetes
	metrics:
	- r2
	- mae
	- rmse
	library_name: sklearn
	pipeline_tag: tabular-regression
	---

	# 📉 Diabetes — Disease Progression Prediction (Linear Regression)

	A Linear Regression model trained on the Diabetes dataset from Azure Open Datasets to predict Y (a quantitative measure of disease progression one year after baseline).

	Built and deployed on Microsoft Fabric during Offline Workshop Training — organized by Microsoft Elevate and Dicoding.

	## 📊 Model Details

	\| Property \| Value \|
	\| :--- \| :--- \|
	\| Model Type \| Linear Regression \|
	\| Framework \| scikit-learn \|
	\| Task \| Tabular Regression \|
	\| Target Variable \| Y (disease progression, continuous) \|
	\| Training Platform \| Microsoft Fabric + MLflow \|
	\| Dataset \| Diabetes (Azure Open Datasets) \|
	\| Total Samples \| 442 \|
	\| Train/Test Split \| 70/30 (`random_state=0`) \|

	## 📝 Features (10)

	\| Feature \| Type \| Description \|
	\| :--- \| :--- \| :--- \|
	\| `AGE` \| int \| Age of patient \|
	\| `SEX` \| int \| Gender \|
	\| `BMI` \| float \| Body Mass Index \|
	\| `BP` \| float \| Average Blood Pressure \|
	\| `S1` \| int \| Total Serum Cholesterol (tc) \|
	\| `S2` \| float \| Low-Density Lipoproteins (ldl) \|
	\| `S3` \| float \| High-Density Lipoproteins (hdl) \|
	\| `S4` \| float \| Total Cholesterol / HDL (tch) \|
	\| `S5` \| float \| Log of Serum Triglycerides (ltg) \|
	\| `S6` \| int \| Blood Sugar Level (glu) \|

	## 📈 Performance

	### Best Model: Linear Regression

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| R² (Coefficient of Determination) \| 0.3929 \|
	\| MAE (Mean Absolute Error) \| 44.62 \|
	\| RMSE (Root Mean Squared Error) \| 55.65 \|
	\| CV R² (5-fold) \| 0.4823 ± 0.0493 \|

	### All Models Compared

	\| Model \| R² \| MAE \| RMSE \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Linear Regression \| 0.3929 \| 44.62 \| 55.65 \|
	\| Random Forest \| 0.3011 \| 47.86 \| 59.71 \|
	\| XGBoost \| 0.2026 \| 48.93 \| 63.78 \|
	\| Gradient Boosting \| 0.1823 \| 51.44 \| 64.59 \|

	> ℹ️ Note: An R² of ~0.39 is typical for clinical datasets where disease progression depends on many unmeasured factors (genetics, lifestyle, diet). Interestingly, Linear models outperform tree-based models here due to the small sample size (442 rows), avoiding overfitting.

	## 💻 Usage

	```python
	import pickle
	import numpy as np

	# Load model (ensure model.pkl is in the directory)
	with open("model.pkl", "rb") as f:
	model = pickle.load(f)

	# Input: [AGE, SEX, BMI, BP, S1, S2, S3, S4, S5, S6]
	# Example: Patient with average stats
	sample = np.array([[50, 1, 28.5, 90.0, 200, 120.5, 45.0, 4.5, 5.2, 95]])

	# Predict Disease Progression
	prediction = model.predict(sample)
	print(f"Predicted Disease Progression (Y): {prediction[0]:.2f}")

	```

	## 🔍 Key Insights

	* S5 (Log of Serum Triglycerides) is the most important predictor by far (Coefficient: `65.8`), indicating a strong correlation with disease progression.
	* SEX and BMI are the 2nd and 3rd most influential features.
	* Simplicity wins: Linear Regression outperforms complex ensemble models on this small dataset. Simpler models often generalize better when data is limited (`n=442`).
	* Stability: Cross-validation shows moderate stability (`CV R² = 0.48 ± 0.05`), suggesting the model is robust within its performance range.

	## ⚖️ Feature Importance

	Ranked by the absolute value of coefficients:

	\| Rank \| Feature \| Coef (Abs) \| Impact \|
	\| --- \| --- \| --- \| --- \|
	\| 1 \| S5 \| 65.807 \| ⭐⭐⭐⭐⭐ \|
	\| 2 \| SEX \| 18.445 \| ⭐⭐⭐ \|
	\| 3 \| BMI \| 6.246 \| ⭐⭐ \|
	\| 4 \| S4 \| 3.196 \| ⭐ \|
	\| 5 \| BP \| 0.938 \| \|
	\| 6 \| S1 \| 0.694 \| \|
	\| 7 \| S2 \| 0.378 \| \|
	\| 8 \| S3 \| 0.257 \| \|
	\| 9 \| AGE \| 0.191 \| \|
	\| 10 \| S6 \| 0.111 \| \|

	## ⚠️ Intended Use

	* Primary: Educational / demonstration of ML workflow on Microsoft Fabric.
	* Not intended for: Clinical decision-making without further validation.

	## 🙌 Acknowledgments

	* Microsoft Elevate and Dicoding — for organizing Offline Workshop Training.
	* Azure Open Datasets — for providing the Diabetes dataset.
	```
	```