YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🩺 Diabetes & Lifestyle -- Regression and Classification Machine Learning Project

πŸŽ₯ **Presentation Video:**
My video presentation - for assignment2

1. Project Overview

This project presents a full end-to-end machine learning pipeline applied to a large-scale Diabetes and Lifestyle dataset. The project was developed in two major stages:

  • Regression: Predicting a continuous insulin level value.
  • Classification: Converting the regression problem into a multi-class classification task for insulin risk prediction.

The workflow includes: - Data cleaning

  • Exploratory data analysis (EDA)
  • Feature engineering
  • Clustering
  • Model training and evaluation
  • Model selection
  • Model export for deployment

2. Dataset Description

  • Source: Kaggle -- Diabetes and Lifestyle Dataset
  • Number of Rows: 10,000+
  • Number of Features: 15+
  • Target Variable: insulin_level (numeric)
  • Feature Types:
    • Numerical: Age, health indicators, physical measurements\
    • Categorical: Gender, ethnicity, education level, income, employment status, smoking status, diabetes stage, diabetes diagnosis

Research Question

Can lifestyle, demographic, and health-related features accurately predict insulin levels and classify individuals into meaningful insulin risk groups?

I have raised several question which strike me as interesting topics:

  • How does insulin_level differ between diabetes_stage groups?
  • What is the relationship between Age and insulin_level?
  • Does smoking_status relate to insulin_level?
  • How does physical_activity_minutes_per_week relate to insulin_level?

3. Exploratory Data Analysis (EDA)

EDA was conducted in order to understand the structure, distribution, and relationships within the dataset.

Key Analysis Components

  • Distribution analysis of insulin levels
  • Correlation analysis between features
  • Outlier detection using the IQR method
  • Lifestyle comparisons using boxplots and scatter plots

Key Visualizations

  • Insulin Level Distribution -- Histogram
  • Insulin Level by Diabetes Stage -- Boxplot
  • Age vs Insulin Level -- Scatter Plot
  • Feature Correlation Heatmap

Categorical Columns

Here are the columns' values and their distribution looks like:

Numerical Columns

In the numeric columns I have for each column checked the stats: mean, median, std. Moreover, I have aggregated for each column the outliers according to IQR.

Key Insights

  • Insulin level increases significantly with diabetes stage.
  • Age and lifestyle habits show meaningful correlations with insulin.
  • Nonlinear relationships justify tree-based models.

4. Data Cleaning & Preprocessing

  • No missing values appeared in the dataframe, therefore i didn't need to apply imputations.
  • Label encoding of categorical variables using scikit-learn 'LabelEncoder'
  • For the numeric column I have done feature scaling using MinMaxScaler

5. Part 3 -- Baseline Regression Model

A Linear Regression model was trained using all original features as a baseline.

we can see the model is quite naive, it's our baseline.

Evaluation Metrics

  • MAE
  • MSE
  • RMSE
  • RΒ²

Feature importance was analyzed using model coefficients


6. Part 4 -- Feature Engineering & Clustering

Feature Engineering

Here are 4 features I have constructed from scratched:

New Feature #1:

New Feature #2:

New Feature #3:

New Feature #4:

After applying KMEANS (n=3), I have created two more features:

New Feature #5:

cluster_id

New Feature #6:

cluster_distance_min

  • Clusters: 3
  • PCA for visualization


7. Part 5 -- Improved Regression Models

Model MAE MSE RMSE RΒ²
Linear Regression (FE) βœ” βœ” βœ” βœ”
Random Forest Regressor βœ” βœ” βœ” βœ”
Decision Tree Regressor βœ” βœ” βœ” βœ”

βœ… Winning Regression Model: Random Forest Regressor
Exported as winning_model.pkl


8. Part 7 -- Regression to Classification

Insulin levels were converted into three classes using quantile binning: - Low - Medium - High

Macro F1-score was chosen due to class imbalance.


9. Part 8 -- Classification Models

  • Logistic Regression\
  • Random Forest Classifier\
  • KNN

Here is the confusion matrix I got:

The metrics I have used was accuracy.

βœ… **Winning Classification Model:**LogisticRegression
Exported as winning_classification_model.pkl


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support