Tomertg commited on
Commit
f941c6b
·
verified ·
1 Parent(s): e1364cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -33
README.md CHANGED
@@ -2,31 +2,36 @@
2
 
3
  ## Overview
4
 
5
- This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.
6
 
7
- The work includes:
8
 
9
  - Exploratory Data Analysis (EDA)
10
  - Feature engineering
11
- - Regression modeling
12
- - Classification modeling
13
  - Clustering
14
  - Model selection and export
15
 
16
- The final goal was to classify athletes into performance categories and evaluate which model performs best.
17
 
18
  ---
19
 
20
  ## Dataset
21
 
22
- The dataset includes:
23
 
24
  - Body weight
25
  - Height
26
  - Age
27
  - Strength metrics: deadlift, back squat, snatch
28
 
29
- After cleaning, outliers were removed and missing values handled.
 
 
 
 
 
30
 
31
  ---
32
 
@@ -35,27 +40,27 @@ After cleaning, outliers were removed and missing values handled.
35
  ### Average Deadlift by Body Weight
36
  ![img11](img11.png)
37
 
38
- Heavier weight categories generally show higher deadlift performance.
39
 
40
  ### Average Deadlift by Height
41
  ![img12](img12.png)
42
 
43
- Taller athletes tend to lift more, with increasing variance at higher height ranges.
44
 
45
  ### Average Deadlift by Age
46
  ![img13](img13.png)
47
 
48
- Performance peaks around ages 25–34 and gradually decreases afterward.
49
 
50
  ### Body Ratio and Deadlift
51
  ![img14](img14.png)
52
 
53
- Higher strength-to-body weight ratios correlate with higher deadlift results.
54
 
55
  ### Strength Metric Correlations
56
  ![img15](img15.png)
57
 
58
- Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.
59
 
60
  ---
61
 
@@ -66,24 +71,24 @@ A baseline linear regression model was trained to predict deadlift performance.
66
  ### Actual vs Predicted Deadlift
67
  ![img16](img16.png)
68
 
69
- The model follows the general trend but shows noise due to variability between athletes.
70
 
71
  ---
72
 
73
  ## Clustering
74
 
75
- K-Means clustering was applied to identify athlete groups based on performance metrics.
76
 
77
  ### Cluster Visualization (PCA)
78
  ![img17](img17.png)
79
 
80
- Three clear performance clusters were identified, separating athletes by overall strength level.
81
 
82
  ---
83
 
84
  ## Classification Modeling
85
 
86
- Athletes were categorized into three balanced deadlift performance classes:
87
 
88
  - Low
89
  - Medium
@@ -110,27 +115,25 @@ Gradient Boosting:
110
 
111
  ## Model Evaluation
112
 
113
- All models achieved high accuracy, precision, recall, and F1-score.
114
-
115
- However:
116
 
117
- - Random Forest made fewer critical misclassifications
118
- - It showed better separation between High and Low classes
119
- - It achieved the highest F1-score
120
 
121
- Therefore, the Random Forest model was selected as the final classification model.
 
 
122
 
123
  ---
124
 
125
  ## Final Model
126
 
127
- The winning model was:
128
 
129
- Random Forest Classifier
130
 
131
- It was trained fully and exported as:
132
 
133
- `classification_winner.pkl`
134
 
135
  ---
136
 
@@ -139,28 +142,30 @@ It was trained fully and exported as:
139
  ```python
140
  import pickle
141
 
142
- with open("classification_winner.pkl", "rb") as f:
143
  model = pickle.load(f)
144
 
145
  prediction = model.predict(X_sample)
146
 
 
 
147
  ## Conclusion
148
 
149
  This project provided several key insights:
150
 
151
  - Weight, height, and body ratio strongly influence deadlift performance
152
- - Age shows a performance peak followed by decline
153
  - Deadlift and back squat are closely related
154
- - Classification models performed extremely well due to clear class separation
155
  - Random Forest proved to be the most reliable model
156
 
157
- This project demonstrates a full machine learning workflow, including:
158
 
159
  - Data exploration
160
  - Feature engineering
161
  - Model training
162
  - Evaluation
163
  - Model selection
164
- - Export and deployment
165
 
166
- The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.
 
2
 
3
  ## Overview
4
 
5
+ This project explores a dataset of athlete strength metrics to understand patterns in deadlift performance and to build models that can predict and classify athletes based on strength.
6
 
7
+ The workflow includes:
8
 
9
  - Exploratory Data Analysis (EDA)
10
  - Feature engineering
11
+ - Regression models
12
+ - Classification models
13
  - Clustering
14
  - Model selection and export
15
 
16
+ The final objective was to classify athletes into performance categories and evaluate which model performs best.
17
 
18
  ---
19
 
20
  ## Dataset
21
 
22
+ The dataset contains:
23
 
24
  - Body weight
25
  - Height
26
  - Age
27
  - Strength metrics: deadlift, back squat, snatch
28
 
29
+ After cleaning:
30
+
31
+ - Duplicate rows were removed
32
+ - Placeholder values were replaced
33
+ - Unrealistic values were filtered
34
+ - Missing key fields were dropped
35
 
36
  ---
37
 
 
40
  ### Average Deadlift by Body Weight
41
  ![img11](img11.png)
42
 
43
+ Heavier weight groups generally show higher deadlift performance.
44
 
45
  ### Average Deadlift by Height
46
  ![img12](img12.png)
47
 
48
+ Taller athletes tend to lift more, with higher variability at the upper height ranges.
49
 
50
  ### Average Deadlift by Age
51
  ![img13](img13.png)
52
 
53
+ Performance peaks around ages 25–34 and gradually declines afterward.
54
 
55
  ### Body Ratio and Deadlift
56
  ![img14](img14.png)
57
 
58
+ Higher weight-to-height ratios are associated with stronger lifts.
59
 
60
  ### Strength Metric Correlations
61
  ![img15](img15.png)
62
 
63
+ Deadlift and back squat show a strong positive correlation, while snatch is only weakly related.
64
 
65
  ---
66
 
 
71
  ### Actual vs Predicted Deadlift
72
  ![img16](img16.png)
73
 
74
+ The model follows the general trend but shows noise due to differences between athletes.
75
 
76
  ---
77
 
78
  ## Clustering
79
 
80
+ K-Means clustering was used to group athletes based on strength metrics.
81
 
82
  ### Cluster Visualization (PCA)
83
  ![img17](img17.png)
84
 
85
+ Three performance clusters were identified, separating athletes by overall strength level.
86
 
87
  ---
88
 
89
  ## Classification Modeling
90
 
91
+ Athletes were grouped into three balanced performance classes:
92
 
93
  - Low
94
  - Medium
 
115
 
116
  ## Model Evaluation
117
 
118
+ All models performed well across accuracy, precision, recall, and F1-score.
 
 
119
 
120
+ Random Forest stood out because it:
 
 
121
 
122
+ - Made fewer major misclassifications
123
+ - Separated high and low performers better
124
+ - Achieved the highest F1-score
125
 
126
  ---
127
 
128
  ## Final Model
129
 
130
+ The final selected model:
131
 
132
+ **Random Forest Classifier**
133
 
134
+ It was trained on the full dataset and exported as:
135
 
136
+ `best_classifier.pkl`
137
 
138
  ---
139
 
 
142
  ```python
143
  import pickle
144
 
145
+ with open("best_classifier.pkl", "rb") as f:
146
  model = pickle.load(f)
147
 
148
  prediction = model.predict(X_sample)
149
 
150
+
151
+
152
  ## Conclusion
153
 
154
  This project provided several key insights:
155
 
156
  - Weight, height, and body ratio strongly influence deadlift performance
157
+ - Performance peaks in the late 20s and declines afterward
158
  - Deadlift and back squat are closely related
159
+ - Classification models performed very well due to clear class separation
160
  - Random Forest proved to be the most reliable model
161
 
162
+ The work demonstrates a full machine learning workflow, including:
163
 
164
  - Data exploration
165
  - Feature engineering
166
  - Model training
167
  - Evaluation
168
  - Model selection
169
+ - Export
170
 
171
+ The final Random Forest model delivers strong performance and can be used to classify athletes into strength categories based on their physical and strength metrics.