YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🩺 Diabetes & Lifestyle -- Regression and Classification Machine Learning Project

🎥 Presentation Video:
My video presentation - for assignment2

1. Project Overview

This project presents a full end-to-end machine learning pipeline applied to a large-scale Diabetes and Lifestyle dataset. The project was developed in two major stages:

Regression: Predicting a continuous insulin level value.
Classification: Converting the regression problem into a multi-class classification task for insulin risk prediction.

The workflow includes: - Data cleaning

Exploratory data analysis (EDA)
Feature engineering
Clustering
Model training and evaluation
Model selection
Model export for deployment

2. Dataset Description

Source: Kaggle -- Diabetes and Lifestyle Dataset
Number of Rows: 10,000+
Number of Features: 15+
Target Variable: insulin_level (numeric)
Feature Types:
- Numerical: Age, health indicators, physical measurements\
- Categorical: Gender, ethnicity, education level, income, employment status, smoking status, diabetes stage, diabetes diagnosis

Research Question

Can lifestyle, demographic, and health-related features accurately predict insulin levels and classify individuals into meaningful insulin risk groups?

I have raised several question which strike me as interesting topics:

How does insulin_level differ between diabetes_stage groups?
What is the relationship between Age and insulin_level?
Does smoking_status relate to insulin_level?
How does physical_activity_minutes_per_week relate to insulin_level?

3. Exploratory Data Analysis (EDA)

EDA was conducted in order to understand the structure, distribution, and relationships within the dataset.

Key Analysis Components

Distribution analysis of insulin levels
Correlation analysis between features
Outlier detection using the IQR method
Lifestyle comparisons using boxplots and scatter plots

Key Visualizations

Insulin Level Distribution -- Histogram
Insulin Level by Diabetes Stage -- Boxplot
Age vs Insulin Level -- Scatter Plot
Feature Correlation Heatmap

Categorical Columns

Here are the columns' values and their distribution looks like:

tool tip here

Numerical Columns

In the numeric columns I have for each column checked the stats: mean, median, std. Moreover, I have aggregated for each column the outliers according to IQR.

Key Insights

Insulin level increases significantly with diabetes stage.
Age and lifestyle habits show meaningful correlations with insulin.
Nonlinear relationships justify tree-based models.

4. Data Cleaning & Preprocessing

No missing values appeared in the dataframe, therefore i didn't need to apply imputations.
Label encoding of categorical variables using scikit-learn 'LabelEncoder'
For the numeric column I have done feature scaling using MinMaxScaler

5. Part 3 -- Baseline Regression Model

A Linear Regression model was trained using all original features as a baseline.

tool tip here

we can see the model is quite naive, it's our baseline.

Evaluation Metrics

MAE
MSE
RMSE
R²

Feature importance was analyzed using model coefficients

tool tip here

6. Part 4 -- Feature Engineering & Clustering

Feature Engineering

Here are 4 features I have constructed from scratched:

New Feature #1:

tool tip here

New Feature #2:

tool tip here

New Feature #3:

tool tip here

New Feature #4:

tool tip here

After applying KMEANS (n=3), I have created two more features:

New Feature #5:

cluster_id

New Feature #6:

cluster_distance_min

Clusters: 3
PCA for visualization

tool tip here

7. Part 5 -- Improved Regression Models

Model	MAE	MSE	RMSE	R²
Linear Regression (FE)	✔	✔	✔	✔
Random Forest Regressor	✔	✔	✔	✔
Decision Tree Regressor	✔	✔	✔	✔

✅ Winning Regression Model: Random Forest Regressor
Exported as winning_model.pkl

8. Part 7 -- Regression to Classification

Insulin levels were converted into three classes using quantile binning: - Low - Medium - High

Macro F1-score was chosen due to class imbalance.

9. Part 8 -- Classification Models

Logistic Regression\
Random Forest Classifier\
KNN

Here is the confusion matrix I got:

tool tip here

The metrics I have used was accuracy.

✅ **Winning Classification Model:**LogisticRegression
Exported as winning_classification_model.pkl

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

🩺 Diabetes & Lifestyle -- Regression and Classification Machine Learning Project

🎥 **Presentation Video:**My video presentation - for assignment2

1. Project Overview

2. Dataset Description

Research Question

3. Exploratory Data Analysis (EDA)

Key Analysis Components

Key Visualizations

Categorical Columns

Numerical Columns

Key Insights

4. Data Cleaning & Preprocessing

5. Part 3 -- Baseline Regression Model

Evaluation Metrics

6. Part 4 -- Feature Engineering & Clustering

Feature Engineering

7. Part 5 -- Improved Regression Models

8. Part 7 -- Regression to Classification

9. Part 8 -- Classification Models

🎥 Presentation Video:
My video presentation - for assignment2