Strength Performance Analysis and Modeling
Overview
This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.
The work includes:
- Exploratory Data Analysis (EDA)
- Feature engineering
- Regression modeling
- Classification modeling
- Clustering
- Model selection and export
The final goal was to classify athletes into performance categories and evaluate which model performs best.
Dataset
The dataset includes:
- Body weight
- Height
- Age
- Strength metrics: deadlift, back squat, snatch
After cleaning, outliers were removed and missing values handled.
Exploratory Data Analysis (EDA)
Average Deadlift by Body Weight
Heavier weight categories generally show higher deadlift performance.
Average Deadlift by Height
Taller athletes tend to lift more, with increasing variance at higher height ranges.
Average Deadlift by Age
Performance peaks around ages 25–34 and gradually decreases afterward.
Body Ratio and Deadlift
Higher strength-to-body weight ratios correlate with higher deadlift results.
Strength Metric Correlations
Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.
Regression Modeling
A baseline linear regression model was trained to predict deadlift performance.
Actual vs Predicted Deadlift
The model follows the general trend but shows noise due to variability between athletes.
Clustering
K-Means clustering was applied to identify athlete groups based on performance metrics.
Cluster Visualization (PCA)
Three clear performance clusters were identified, separating athletes by overall strength level.
Classification Modeling
Athletes were categorized into three balanced deadlift performance classes:
- Low
- Medium
- High
Models trained:
- Logistic Regression
- Random Forest
- Gradient Boosting
Confusion Matrices
Model Evaluation
All models achieved high accuracy, precision, recall, and F1-score.
However:
- Random Forest made fewer critical misclassifications
- It showed better separation between High and Low classes
- It achieved the highest F1-score
Therefore, the Random Forest model was selected as the final classification model.
Final Model
The winning model was:
Random Forest Classifier
It was trained fully and exported as:
classification_winner.pkl
How to Load the Model
import pickle
with open("classification_winner.pkl", "rb") as f:
model = pickle.load(f)
prediction = model.predict(X_sample)
## Conclusion
This project provided several key insights:
- Weight, height, and body ratio strongly influence deadlift performance
- Age shows a performance peak followed by decline
- Deadlift and back squat are closely related
- Classification models performed extremely well due to clear class separation
- Random Forest proved to be the most reliable model
This project demonstrates a full machine learning workflow, including:
- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export and deployment
The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.









