Update README.md
Browse files
README.md
CHANGED
|
@@ -31,7 +31,7 @@ Classification: Categorizing a workout as Low, Medium, or High intensity.
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
-
**Part 2: Exploratory Data Analysis (EDA)**
|
| 35 |
|
| 36 |
**Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
|
| 37 |
|
|
@@ -87,7 +87,9 @@ Secondary Predictors: Workout_Frequency, Workout_Type, Experience_Level.
|
|
| 87 |
Surprisingly context variables such as Age, Weight, Height and BMI did not show strong correlations
|
| 88 |
Null: Diet_Type (confirmed as non-predictive, but kept for context).
|
| 89 |
|
| 90 |
-
|
|
|
|
|
|
|
| 91 |
|
| 92 |
**1. Feature Selection & Preprocessing**
|
| 93 |
Based on the insights gathered during the EDA phase, I selected a comprehensive set of features to predict Calories_Burned:
|
|
@@ -138,7 +140,9 @@ Secondary Drivers: Experience_Level, Weight, and BMI had meaningful but smaller
|
|
| 138 |
|
| 139 |
In conclusion Duration and Activity Type are the dominant predictors, validating the initial EDA findings.
|
| 140 |
|
| 141 |
-
|
|
|
|
|
|
|
| 142 |
|
| 143 |
While the baseline Linear Regression model performed well ($R^2 \approx 0.967$), the residual analysis revealed curved patterns, suggesting that the relationship between predictors and calorie burn is non-linear. To address this and capture complex workout behaviors better, I engineered 6 new features before training advanced models.
|
| 144 |
|
|
@@ -210,7 +214,9 @@ Physique_Dist (numeric)
|
|
| 210 |
|
| 211 |
Distance from each user to their cluster centroid, indicating how typical or atypical they are within their body type.
|
| 212 |
|
| 213 |
-
|
|
|
|
|
|
|
| 214 |
|
| 215 |
After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
|
| 216 |
|
|
@@ -244,13 +250,15 @@ To understand how the Random Forest makes predictions, I examined its top featur
|
|
| 244 |
|
| 245 |
Session_Duration and its engineered non-linear variant (Session_Duration_sq) were the strongest predictors. Workout_Type (specifically Yoga and HIIT) played a major role, confirming that different activities have distinct metabolic profiles encoded in the data. The engineered feature Fitness_Maturity was quite influential, proving that combining demographic and behavioral data improves accuracy. The Heart Rate Anomaly: Interestingly, heart-rate features showed relatively low importance compared to duration. This validates my earlier EDA finding that this dataset prioritizes Duration and Activity Type over physiological features such as heart-rate, BMI and Weight
|
| 246 |
|
| 247 |
-
|
| 248 |
|
| 249 |
-
|
| 250 |
|
|
|
|
| 251 |
|
|
|
|
| 252 |
|
| 253 |
-
**Part 7: Regression to Classification**
|
| 254 |
|
| 255 |
7.1 Converting the Regression Target into Classes
|
| 256 |
|
|
@@ -302,7 +310,9 @@ Accuracy is a reliable metric for evaluating model performance
|
|
| 302 |
|
| 303 |
I will still examine Precision, Recall, and F1-scores to confirm the model performs consistently for all classes (e.g., ensuring Medium is not confused with High)
|
| 304 |
|
| 305 |
-
|
|
|
|
|
|
|
| 306 |
**8.1: Precision vs Recall + Error Type Analysis**
|
| 307 |
|
| 308 |
In this task, the goal is to classify workouts into Low, Medium, or High calorie burn.
|
|
@@ -382,7 +392,9 @@ The trained Random Forest pipeline, including the scaler, clustering model, thre
|
|
| 382 |
|
| 383 |
[Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
|
| 384 |
|
| 385 |
-
|
|
|
|
|
|
|
| 386 |
|
| 387 |
The analysis identified the Random Forest algorithm as the superior model, achieving near-perfect performance for both regression (R^2 = 0.9999) and classification (99.88% Accuracy).While these metrics demonstrate exceptional predictive power, the remarkably high accuracy, combined with the unexpectedly low feature importance of Heart Rate and Weight, suggests the underlying dataset is likely synthetic. In real-world physiology, heart rate and body mass are critical drivers of energy expenditure; their lower correlation here indicates the data was likely generated using a deterministic formula heavily weighted toward Duration and Activity Type.
|
| 388 |
|
|
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
+
# **Part 2: Exploratory Data Analysis (EDA)**
|
| 35 |
|
| 36 |
**Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
|
| 37 |
|
|
|
|
| 87 |
Surprisingly context variables such as Age, Weight, Height and BMI did not show strong correlations
|
| 88 |
Null: Diet_Type (confirmed as non-predictive, but kept for context).
|
| 89 |
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
# **Part 3: Baseline Model**
|
| 93 |
|
| 94 |
**1. Feature Selection & Preprocessing**
|
| 95 |
Based on the insights gathered during the EDA phase, I selected a comprehensive set of features to predict Calories_Burned:
|
|
|
|
| 140 |
|
| 141 |
In conclusion Duration and Activity Type are the dominant predictors, validating the initial EDA findings.
|
| 142 |
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
# **Part 4: Feature Engineering**
|
| 146 |
|
| 147 |
While the baseline Linear Regression model performed well ($R^2 \approx 0.967$), the residual analysis revealed curved patterns, suggesting that the relationship between predictors and calorie burn is non-linear. To address this and capture complex workout behaviors better, I engineered 6 new features before training advanced models.
|
| 148 |
|
|
|
|
| 214 |
|
| 215 |
Distance from each user to their cluster centroid, indicating how typical or atypical they are within their body type.
|
| 216 |
|
| 217 |
+
---
|
| 218 |
+
|
| 219 |
+
# **Part 5:Train and Evaluate Three Improved Models**
|
| 220 |
|
| 221 |
After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
|
| 222 |
|
|
|
|
| 250 |
|
| 251 |
Session_Duration and its engineered non-linear variant (Session_Duration_sq) were the strongest predictors. Workout_Type (specifically Yoga and HIIT) played a major role, confirming that different activities have distinct metabolic profiles encoded in the data. The engineered feature Fitness_Maturity was quite influential, proving that combining demographic and behavioral data improves accuracy. The Heart Rate Anomaly: Interestingly, heart-rate features showed relatively low importance compared to duration. This validates my earlier EDA finding that this dataset prioritizes Duration and Activity Type over physiological features such as heart-rate, BMI and Weight
|
| 252 |
|
| 253 |
+
---
|
| 254 |
|
| 255 |
+
# **Part 6: Pickle file of winning model**
|
| 256 |
|
| 257 |
+
[Download Winning Regression Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_regression_pipeline.pkl)
|
| 258 |
|
| 259 |
+
---
|
| 260 |
|
| 261 |
+
#**Part 7: Regression to Classification**
|
| 262 |
|
| 263 |
7.1 Converting the Regression Target into Classes
|
| 264 |
|
|
|
|
| 310 |
|
| 311 |
I will still examine Precision, Recall, and F1-scores to confirm the model performs consistently for all classes (e.g., ensuring Medium is not confused with High)
|
| 312 |
|
| 313 |
+
---
|
| 314 |
+
|
| 315 |
+
# **Part 8: Train & Evaluate Classification Models**
|
| 316 |
**8.1: Precision vs Recall + Error Type Analysis**
|
| 317 |
|
| 318 |
In this task, the goal is to classify workouts into Low, Medium, or High calorie burn.
|
|
|
|
| 392 |
|
| 393 |
[Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
|
| 394 |
|
| 395 |
+
---
|
| 396 |
+
|
| 397 |
+
# **Conclusion**
|
| 398 |
|
| 399 |
The analysis identified the Random Forest algorithm as the superior model, achieving near-perfect performance for both regression (R^2 = 0.9999) and classification (99.88% Accuracy).While these metrics demonstrate exceptional predictive power, the remarkably high accuracy, combined with the unexpectedly low feature importance of Heart Rate and Weight, suggests the underlying dataset is likely synthetic. In real-world physiology, heart rate and body mass are critical drivers of energy expenditure; their lower correlation here indicates the data was likely generated using a deterministic formula heavily weighted toward Duration and Activity Type.
|
| 400 |
|