Gradient_Boosting / README.md
Tomertg's picture
Update README.md
d88856f verified
# Strength Performance Analysis and Modeling
## Overview
This project explores a dataset of athlete strength metrics to understand patterns in deadlift performance and to build models that can predict and classify athletes based on strength.
The workflow includes:
- Exploratory Data Analysis (EDA)
- Feature engineering
- Regression models
- Classification models
- Clustering
- Model selection and export
The final objective was to classify athletes into performance categories and evaluate which model performs best.
---
## Dataset
The dataset contains:
- Body weight
- Height
- Age
- Strength metrics: deadlift, back squat, snatch
After cleaning:
- Duplicate rows were removed
- Placeholder values were replaced
- Unrealistic values were filtered
- Missing key fields were dropped
---
## Exploratory Data Analysis (EDA)
### Average Deadlift by Body Weight
![img11](img11.png)
Heavier weight groups generally show higher deadlift performance.
### Average Deadlift by Height
![img12](img12.png)
Taller athletes tend to lift more, with higher variability at the upper height ranges.
### Average Deadlift by Age
![img13](img13.png)
Performance peaks around ages 25–34 and gradually declines afterward.
### Body Ratio and Deadlift
![img14](img14.png)
Higher weight-to-height ratios are associated with stronger lifts.
### Strength Metric Correlations
![img15](img15.png)
Deadlift and back squat show a strong positive correlation, while snatch is only weakly related.
---
## Regression Modeling
A baseline linear regression model was trained to predict deadlift performance.
### Actual vs Predicted Deadlift
![img16](img16.png)
The model follows the general trend but shows noise due to differences between athletes.
---
## Clustering
K-Means clustering was used to group athletes based on strength metrics.
### Cluster Visualization (PCA)
![img17](img17.png)
Three performance clusters were identified, separating athletes by overall strength level.
---
## Classification Modeling
Athletes were grouped into three balanced performance classes:
- Low
- Medium
- High
Models trained:
- Logistic Regression
- Random Forest
- Gradient Boosting
### Confusion Matrices
Logistic Regression:
![img18](img18.png)
Random Forest:
![img19](img19.png)
Gradient Boosting:
![img20](img20.png)
---
## Model Evaluation
All models performed well across accuracy, precision, recall, and F1-score.
Random Forest stood out because it:
- Made fewer major misclassifications
- Separated high and low performers better
- Achieved the highest F1-score
---
## Final Model
The final selected model:
**Random Forest Classifier**
It was trained on the full dataset and exported as:
`best_classifier.pkl`
---
## How to Load the Model
```python
import pickle
with open("best_classifier.pkl", "rb") as f:
model = pickle.load(f)
prediction = model.predict(X_sample)
```
---
## Conclusion
This project provided several key insights:
- Weight, height, and body ratio strongly influence deadlift performance
- Performance peaks in the late 20s and declines afterward
- Deadlift and back squat are closely related
- Classification models performed very well due to clear class separation
- Random Forest proved to be the most reliable model
The work demonstrates a full machine learning workflow, including:
- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export
The final Random Forest model delivers strong performance and can be used to classify athletes into strength categories based on their physical and strength metrics.
## Presentation Video
[![Watch the video](https://img.youtube.com/vi/R0YGueMVqko/0.jpg)](https://youtu.be/R0YGueMVqko)