# Strength Performance Analysis and Modeling

## Overview

This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.

The work includes:

- Exploratory Data Analysis (EDA)
- Feature engineering
- Regression modeling
- Classification modeling
- Clustering
- Model selection and export

The final goal was to classify athletes into performance categories and evaluate which model performs best.

---

## Dataset

The dataset includes:

- Body weight
- Height
- Age
- Strength metrics: deadlift, back squat, snatch

After cleaning, outliers were removed and missing values handled.

---

## Exploratory Data Analysis (EDA)

### Average Deadlift by Body Weight
![img11](img11.png)

Heavier weight categories generally show higher deadlift performance.

### Average Deadlift by Height
![img12](img12.png)

Taller athletes tend to lift more, with increasing variance at higher height ranges.

### Average Deadlift by Age
![img13](img13.png)

Performance peaks around ages 25–34 and gradually decreases afterward.

### Body Ratio and Deadlift
![img14](img14.png)

Higher strength-to-body weight ratios correlate with higher deadlift results.

### Strength Metric Correlations
![img15](img15.png)

Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.

---

## Regression Modeling

A baseline linear regression model was trained to predict deadlift performance.

### Actual vs Predicted Deadlift
![img16](img16.png)

The model follows the general trend but shows noise due to variability between athletes.

---

## Clustering

K-Means clustering was applied to identify athlete groups based on performance metrics.

### Cluster Visualization (PCA)
![img17](img17.png)

Three clear performance clusters were identified, separating athletes by overall strength level.

---

## Classification Modeling

Athletes were categorized into three balanced deadlift performance classes:

- Low
- Medium
- High

Models trained:

- Logistic Regression
- Random Forest
- Gradient Boosting

### Confusion Matrices

Logistic Regression:
![img18](img18.png)

Random Forest:
![img19](img19.png)

Gradient Boosting:
![img20](img20.png)

---

## Model Evaluation

All models achieved high accuracy, precision, recall, and F1-score.

However:

- Random Forest made fewer critical misclassifications
- It showed better separation between High and Low classes
- It achieved the highest F1-score

Therefore, the Random Forest model was selected as the final classification model.

---

## Final Model

The winning model was:

Random Forest Classifier

It was trained fully and exported as:

`classification_winner.pkl`

---

## How to Load the Model

```python
import pickle

with open("classification_winner.pkl", "rb") as f:
    model = pickle.load(f)

prediction = model.predict(X_sample)

## Conclusion

This project provided several key insights:

- Weight, height, and body ratio strongly influence deadlift performance
- Age shows a performance peak followed by decline
- Deadlift and back squat are closely related
- Classification models performed extremely well due to clear class separation
- Random Forest proved to be the most reliable model

This project demonstrates a full machine learning workflow, including:

- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export and deployment

The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.