uleeberber commited on
Commit
6658224
·
verified ·
1 Parent(s): 657eeef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -9
README.md CHANGED
@@ -31,7 +31,7 @@ Classification: Categorizing a workout as Low, Medium, or High intensity.
31
 
32
  ---
33
 
34
- **Part 2: Exploratory Data Analysis (EDA)**
35
 
36
  **Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
37
 
@@ -87,7 +87,9 @@ Secondary Predictors: Workout_Frequency, Workout_Type, Experience_Level.
87
  Surprisingly context variables such as Age, Weight, Height and BMI did not show strong correlations
88
  Null: Diet_Type (confirmed as non-predictive, but kept for context).
89
 
90
- **Part 3: Baseline Model**
 
 
91
 
92
  **1. Feature Selection & Preprocessing**
93
  Based on the insights gathered during the EDA phase, I selected a comprehensive set of features to predict Calories_Burned:
@@ -138,7 +140,9 @@ Secondary Drivers: Experience_Level, Weight, and BMI had meaningful but smaller
138
 
139
  In conclusion Duration and Activity Type are the dominant predictors, validating the initial EDA findings.
140
 
141
- **Part 4: Feature Engineering**
 
 
142
 
143
  While the baseline Linear Regression model performed well ($R^2 \approx 0.967$), the residual analysis revealed curved patterns, suggesting that the relationship between predictors and calorie burn is non-linear. To address this and capture complex workout behaviors better, I engineered 6 new features before training advanced models.
144
 
@@ -210,7 +214,9 @@ Physique_Dist (numeric)
210
 
211
  Distance from each user to their cluster centroid, indicating how typical or atypical they are within their body type.
212
 
213
- **Part 5:Train and Evaluate Three Improved Models**
 
 
214
 
215
  After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
216
 
@@ -244,13 +250,15 @@ To understand how the Random Forest makes predictions, I examined its top featur
244
 
245
  Session_Duration and its engineered non-linear variant (Session_Duration_sq) were the strongest predictors. Workout_Type (specifically Yoga and HIIT) played a major role, confirming that different activities have distinct metabolic profiles encoded in the data. The engineered feature Fitness_Maturity was quite influential, proving that combining demographic and behavioral data improves accuracy. The Heart Rate Anomaly: Interestingly, heart-rate features showed relatively low importance compared to duration. This validates my earlier EDA finding that this dataset prioritizes Duration and Activity Type over physiological features such as heart-rate, BMI and Weight
246
 
247
- **Part 6: Pickle file of winning model**
248
 
249
- [Download Winning Regression Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_regression_pipeline.pkl)
250
 
 
251
 
 
252
 
253
- **Part 7: Regression to Classification**
254
 
255
  7.1 Converting the Regression Target into Classes
256
 
@@ -302,7 +310,9 @@ Accuracy is a reliable metric for evaluating model performance
302
 
303
  I will still examine Precision, Recall, and F1-scores to confirm the model performs consistently for all classes (e.g., ensuring Medium is not confused with High)
304
 
305
- **Part 8: Train & Evaluate Classification Models**
 
 
306
  **8.1: Precision vs Recall + Error Type Analysis**
307
 
308
  In this task, the goal is to classify workouts into Low, Medium, or High calorie burn.
@@ -382,7 +392,9 @@ The trained Random Forest pipeline, including the scaler, clustering model, thre
382
 
383
  [Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
384
 
385
- **Conclusion**
 
 
386
 
387
  The analysis identified the Random Forest algorithm as the superior model, achieving near-perfect performance for both regression (R^2 = 0.9999) and classification (99.88% Accuracy).While these metrics demonstrate exceptional predictive power, the remarkably high accuracy, combined with the unexpectedly low feature importance of Heart Rate and Weight, suggests the underlying dataset is likely synthetic. In real-world physiology, heart rate and body mass are critical drivers of energy expenditure; their lower correlation here indicates the data was likely generated using a deterministic formula heavily weighted toward Duration and Activity Type.
388
 
 
31
 
32
  ---
33
 
34
+ # **Part 2: Exploratory Data Analysis (EDA)**
35
 
36
  **Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
37
 
 
87
  Surprisingly context variables such as Age, Weight, Height and BMI did not show strong correlations
88
  Null: Diet_Type (confirmed as non-predictive, but kept for context).
89
 
90
+ ---
91
+
92
+ # **Part 3: Baseline Model**
93
 
94
  **1. Feature Selection & Preprocessing**
95
  Based on the insights gathered during the EDA phase, I selected a comprehensive set of features to predict Calories_Burned:
 
140
 
141
  In conclusion Duration and Activity Type are the dominant predictors, validating the initial EDA findings.
142
 
143
+ ---
144
+
145
+ # **Part 4: Feature Engineering**
146
 
147
  While the baseline Linear Regression model performed well ($R^2 \approx 0.967$), the residual analysis revealed curved patterns, suggesting that the relationship between predictors and calorie burn is non-linear. To address this and capture complex workout behaviors better, I engineered 6 new features before training advanced models.
148
 
 
214
 
215
  Distance from each user to their cluster centroid, indicating how typical or atypical they are within their body type.
216
 
217
+ ---
218
+
219
+ # **Part 5:Train and Evaluate Three Improved Models**
220
 
221
  After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
222
 
 
250
 
251
  Session_Duration and its engineered non-linear variant (Session_Duration_sq) were the strongest predictors. Workout_Type (specifically Yoga and HIIT) played a major role, confirming that different activities have distinct metabolic profiles encoded in the data. The engineered feature Fitness_Maturity was quite influential, proving that combining demographic and behavioral data improves accuracy. The Heart Rate Anomaly: Interestingly, heart-rate features showed relatively low importance compared to duration. This validates my earlier EDA finding that this dataset prioritizes Duration and Activity Type over physiological features such as heart-rate, BMI and Weight
252
 
253
+ ---
254
 
255
+ # **Part 6: Pickle file of winning model**
256
 
257
+ [Download Winning Regression Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_regression_pipeline.pkl)
258
 
259
+ ---
260
 
261
+ #**Part 7: Regression to Classification**
262
 
263
  7.1 Converting the Regression Target into Classes
264
 
 
310
 
311
  I will still examine Precision, Recall, and F1-scores to confirm the model performs consistently for all classes (e.g., ensuring Medium is not confused with High)
312
 
313
+ ---
314
+
315
+ # **Part 8: Train & Evaluate Classification Models**
316
  **8.1: Precision vs Recall + Error Type Analysis**
317
 
318
  In this task, the goal is to classify workouts into Low, Medium, or High calorie burn.
 
392
 
393
  [Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
394
 
395
+ ---
396
+
397
+ # **Conclusion**
398
 
399
  The analysis identified the Random Forest algorithm as the superior model, achieving near-perfect performance for both regression (R^2 = 0.9999) and classification (99.88% Accuracy).While these metrics demonstrate exceptional predictive power, the remarkably high accuracy, combined with the unexpectedly low feature importance of Heart Rate and Weight, suggests the underlying dataset is likely synthetic. In real-world physiology, heart rate and body mass are critical drivers of energy expenditure; their lower correlation here indicates the data was likely generated using a deterministic formula heavily weighted toward Duration and Activity Type.
400