File size: 3,728 Bytes
c7dc25c
a30c2ac
c7dc25c
a30c2ac
f941c6b
a30c2ac
f941c6b
a30c2ac
c7dc25c
 
f941c6b
 
c7dc25c
 
 
f941c6b
c7dc25c
 
 
 
 
f941c6b
c7dc25c
 
 
 
 
 
f941c6b
 
 
 
 
 
c7dc25c
 
 
 
 
 
 
 
f941c6b
c7dc25c
 
 
 
f941c6b
c7dc25c
 
 
 
f941c6b
c7dc25c
 
 
 
f941c6b
c7dc25c
 
 
 
f941c6b
c7dc25c
 
 
 
 
 
 
 
 
 
f941c6b
c7dc25c
 
 
 
 
f941c6b
c7dc25c
 
 
 
f941c6b
c7dc25c
 
 
 
 
f941c6b
c7dc25c
 
 
 
 
 
 
 
a30c2ac
 
 
c7dc25c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f941c6b
c7dc25c
f941c6b
c7dc25c
f941c6b
 
 
c7dc25c
 
 
 
 
f941c6b
c7dc25c
f941c6b
c7dc25c
f941c6b
a30c2ac
f941c6b
a30c2ac
c7dc25c
a30c2ac
c7dc25c
a30c2ac
 
 
 
f941c6b
a30c2ac
 
c7dc25c
65c101f
f941c6b
76d8163
 
97bb51e
 
 
 
 
f941c6b
97bb51e
f941c6b
97bb51e
 
f941c6b
97bb51e
 
 
 
 
 
f941c6b
97bb51e
005d793
 
 
d88856f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Strength Performance Analysis and Modeling

## Overview

This project explores a dataset of athlete strength metrics to understand patterns in deadlift performance and to build models that can predict and classify athletes based on strength.

The workflow includes:

- Exploratory Data Analysis (EDA)
- Feature engineering
- Regression models
- Classification models
- Clustering
- Model selection and export

The final objective was to classify athletes into performance categories and evaluate which model performs best.

---

## Dataset

The dataset contains:

- Body weight
- Height
- Age
- Strength metrics: deadlift, back squat, snatch

After cleaning:

- Duplicate rows were removed
- Placeholder values were replaced
- Unrealistic values were filtered
- Missing key fields were dropped

---

## Exploratory Data Analysis (EDA)

### Average Deadlift by Body Weight
![img11](img11.png)

Heavier weight groups generally show higher deadlift performance.

### Average Deadlift by Height
![img12](img12.png)

Taller athletes tend to lift more, with higher variability at the upper height ranges.

### Average Deadlift by Age
![img13](img13.png)

Performance peaks around ages 25–34 and gradually declines afterward.

### Body Ratio and Deadlift
![img14](img14.png)

Higher weight-to-height ratios are associated with stronger lifts.

### Strength Metric Correlations
![img15](img15.png)

Deadlift and back squat show a strong positive correlation, while snatch is only weakly related.

---

## Regression Modeling

A baseline linear regression model was trained to predict deadlift performance.

### Actual vs Predicted Deadlift
![img16](img16.png)

The model follows the general trend but shows noise due to differences between athletes.

---

## Clustering

K-Means clustering was used to group athletes based on strength metrics.

### Cluster Visualization (PCA)
![img17](img17.png)

Three performance clusters were identified, separating athletes by overall strength level.

---

## Classification Modeling

Athletes were grouped into three balanced performance classes:

- Low
- Medium
- High

Models trained:

- Logistic Regression
- Random Forest
- Gradient Boosting

### Confusion Matrices

Logistic Regression:
![img18](img18.png)

Random Forest:
![img19](img19.png)

Gradient Boosting:
![img20](img20.png)

---

## Model Evaluation

All models performed well across accuracy, precision, recall, and F1-score.

Random Forest stood out because it:

- Made fewer major misclassifications
- Separated high and low performers better
- Achieved the highest F1-score

---

## Final Model

The final selected model:

**Random Forest Classifier**

It was trained on the full dataset and exported as:

`best_classifier.pkl`

---

## How to Load the Model

```python
import pickle

with open("best_classifier.pkl", "rb") as f:
    model = pickle.load(f)

prediction = model.predict(X_sample)
```

---

## Conclusion

This project provided several key insights:

- Weight, height, and body ratio strongly influence deadlift performance
- Performance peaks in the late 20s and declines afterward
- Deadlift and back squat are closely related
- Classification models performed very well due to clear class separation
- Random Forest proved to be the most reliable model

The work demonstrates a full machine learning workflow, including:

- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export

The final Random Forest model delivers strong performance and can be used to classify athletes into strength categories based on their physical and strength metrics.


## Presentation Video

[![Watch the video](https://img.youtube.com/vi/R0YGueMVqko/0.jpg)](https://youtu.be/R0YGueMVqko)