Update README.md
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ Classification: Categorizing a workout as Low, Medium, or High intensity.
|
|
| 34 |
|
| 35 |
**Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
|
| 36 |
|
| 37 |
-
**
|
| 38 |
|
| 39 |
Sessions last on average ~1.26 hours (SD ≈ 0.34), so most sessions are between ~0.9 and 1.6 hours.
|
| 40 |
|
|
@@ -135,7 +135,7 @@ Top Negative Driver: Workout_Type_Yoga (lower intensity).
|
|
| 135 |
|
| 136 |
Secondary Drivers: Experience_Level, Weight, and BMI had meaningful but smaller impacts.
|
| 137 |
|
| 138 |
-
In
|
| 139 |
|
| 140 |
**Part 4: Feature Engineering**
|
| 141 |
|
|
@@ -213,7 +213,7 @@ Distance from each user to their cluster centroid, indicating how typical or aty
|
|
| 213 |
|
| 214 |
After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
|
| 215 |
|
| 216 |
-
The six engineered numeric
|
| 217 |
|
| 218 |
This richer feature representation allows more advanced models to detect complex, nonlinear relationships that Linear Regression cannot capture.
|
| 219 |
|
|
@@ -381,3 +381,7 @@ The trained Random Forest pipeline, including the scaler, clustering model, thre
|
|
| 381 |
|
| 382 |
[Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
|
| 383 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
**Data Cleaning:** there were no missing value or duplicates allowing for robust modeling without data imputation.
|
| 36 |
|
| 37 |
+
**Descriptive statistics:**
|
| 38 |
|
| 39 |
Sessions last on average ~1.26 hours (SD ≈ 0.34), so most sessions are between ~0.9 and 1.6 hours.
|
| 40 |
|
|
|
|
| 135 |
|
| 136 |
Secondary Drivers: Experience_Level, Weight, and BMI had meaningful but smaller impacts.
|
| 137 |
|
| 138 |
+
In conclusion Duration and Activity Type are the dominant predictors, validating the initial EDA findings.
|
| 139 |
|
| 140 |
**Part 4: Feature Engineering**
|
| 141 |
|
|
|
|
| 213 |
|
| 214 |
After building the baseline regression model, the next step was to retrain and compare multiple models using the fully engineered feature set created in Part 4. This improved dataset included:
|
| 215 |
|
| 216 |
+
The six engineered numeric features, Physique_Cluster (one-hot encoded), Physique_Dist, All original encoded and scaled variables
|
| 217 |
|
| 218 |
This richer feature representation allows more advanced models to detect complex, nonlinear relationships that Linear Regression cannot capture.
|
| 219 |
|
|
|
|
| 381 |
|
| 382 |
[Download Winning Classification Model (.pkl)](https://huggingface.co/uleeberber/models_assignment_2/resolve/main/the_winning_classification_pipeline.pkl)
|
| 383 |
|
| 384 |
+
**Conclusion**
|
| 385 |
+
|
| 386 |
+
The analysis identified the Random Forest algorithm as the superior model, achieving near-perfect performance for both regression (R^2 = 0.9999) and classification (99.88% Accuracy).While these metrics demonstrate exceptional predictive power, the remarkably high accuracy, combined with the unexpectedly low feature importance of Heart Rate and Weight, suggests the underlying dataset is likely synthetic. In real-world physiology, heart rate and body mass are critical drivers of energy expenditure; their lower correlation here indicates the data was likely generated using a deterministic formula heavily weighted toward Duration and Activity Type.
|
| 387 |
+
|