π©Ί Diabetes & Lifestyle -- Regression and Classification Machine Learning Project
π₯ **Presentation Video:**
My video presentation - for assignment2
1. Project Overview
This project presents a full end-to-end machine learning pipeline applied to a large-scale Diabetes and Lifestyle dataset. The project was developed in two major stages:
- Regression: Predicting a continuous insulin level value.
- Classification: Converting the regression problem into a multi-class classification task for insulin risk prediction.
The workflow includes: - Data cleaning
- Exploratory data analysis (EDA)
- Feature engineering
- Clustering
- Model training and evaluation
- Model selection
- Model export for deployment
2. Dataset Description
- Source: Kaggle -- Diabetes and Lifestyle Dataset
- Number of Rows: 10,000+
- Number of Features: 15+
- Target Variable:
insulin_level(numeric) - Feature Types:
- Numerical: Age, health indicators, physical measurements\
- Categorical: Gender, ethnicity, education level, income, employment status, smoking status, diabetes stage, diabetes diagnosis
Research Question
Can lifestyle, demographic, and health-related features accurately predict insulin levels and classify individuals into meaningful insulin risk groups?
I have raised several question which strike me as interesting topics:
- How does insulin_level differ between diabetes_stage groups?
- What is the relationship between Age and insulin_level?
- Does smoking_status relate to insulin_level?
- How does physical_activity_minutes_per_week relate to insulin_level?
3. Exploratory Data Analysis (EDA)
EDA was conducted in order to understand the structure, distribution, and relationships within the dataset.
Key Analysis Components
- Distribution analysis of insulin levels
- Correlation analysis between features
- Outlier detection using the IQR method
- Lifestyle comparisons using boxplots and scatter plots
Key Visualizations
- Insulin Level Distribution -- Histogram
- Insulin Level by Diabetes Stage -- Boxplot
- Age vs Insulin Level -- Scatter Plot
- Feature Correlation Heatmap
Categorical Columns
Here are the columns' values and their distribution looks like:
Numerical Columns
In the numeric columns I have for each column checked the stats: mean, median, std. Moreover, I have aggregated for each column the outliers according to IQR.
Key Insights
- Insulin level increases significantly with diabetes stage.
- Age and lifestyle habits show meaningful correlations with insulin.
- Nonlinear relationships justify tree-based models.
4. Data Cleaning & Preprocessing
- No missing values appeared in the dataframe, therefore i didn't need to apply imputations.
- Label encoding of categorical variables using scikit-learn 'LabelEncoder'
- For the numeric column I have done feature scaling using
MinMaxScaler
5. Part 3 -- Baseline Regression Model
A Linear Regression model was trained using all original features as a baseline.
we can see the model is quite naive, it's our baseline.
Evaluation Metrics
- MAE
- MSE
- RMSE
- RΒ²
Feature importance was analyzed using model coefficients
6. Part 4 -- Feature Engineering & Clustering
Feature Engineering
Here are 4 features I have constructed from scratched:
New Feature #1:
New Feature #2:
New Feature #3:
New Feature #4:
After applying KMEANS (n=3), I have created two more features:
New Feature #5:
cluster_id
New Feature #6:
cluster_distance_min
- Clusters: 3
- PCA for visualization
7. Part 5 -- Improved Regression Models
| Model | MAE | MSE | RMSE | RΒ² |
|---|---|---|---|---|
| Linear Regression (FE) | β | β | β | β |
| Random Forest Regressor | β | β | β | β |
| Decision Tree Regressor | β | β | β | β |
β
Winning Regression Model: Random Forest Regressor
Exported as winning_model.pkl
8. Part 7 -- Regression to Classification
Insulin levels were converted into three classes using quantile binning: - Low - Medium - High
Macro F1-score was chosen due to class imbalance.
9. Part 8 -- Classification Models
- Logistic Regression\
- Random Forest Classifier\
- KNN
Here is the confusion matrix I got:
The metrics I have used was accuracy.
β
**Winning Classification Model:**LogisticRegression
Exported as winning_classification_model.pkl