Gradient_Boosting / README.md
Tomertg's picture
Update README.md
97bb51e verified
|
raw
history blame
3.63 kB

Strength Performance Analysis and Modeling

Overview

This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.

The work includes:

  • Exploratory Data Analysis (EDA)
  • Feature engineering
  • Regression modeling
  • Classification modeling
  • Clustering
  • Model selection and export

The final goal was to classify athletes into performance categories and evaluate which model performs best.


Dataset

The dataset includes:

  • Body weight
  • Height
  • Age
  • Strength metrics: deadlift, back squat, snatch

After cleaning, outliers were removed and missing values handled.


Exploratory Data Analysis (EDA)

Average Deadlift by Body Weight

img11

Heavier weight categories generally show higher deadlift performance.

Average Deadlift by Height

img12

Taller athletes tend to lift more, with increasing variance at higher height ranges.

Average Deadlift by Age

img13

Performance peaks around ages 25–34 and gradually decreases afterward.

Body Ratio and Deadlift

img14

Higher strength-to-body weight ratios correlate with higher deadlift results.

Strength Metric Correlations

img15

Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.


Regression Modeling

A baseline linear regression model was trained to predict deadlift performance.

Actual vs Predicted Deadlift

img16

The model follows the general trend but shows noise due to variability between athletes.


Clustering

K-Means clustering was applied to identify athlete groups based on performance metrics.

Cluster Visualization (PCA)

img17

Three clear performance clusters were identified, separating athletes by overall strength level.


Classification Modeling

Athletes were categorized into three balanced deadlift performance classes:

  • Low
  • Medium
  • High

Models trained:

  • Logistic Regression
  • Random Forest
  • Gradient Boosting

Confusion Matrices

Logistic Regression: img18

Random Forest: img19

Gradient Boosting: img20


Model Evaluation

All models achieved high accuracy, precision, recall, and F1-score.

However:

  • Random Forest made fewer critical misclassifications
  • It showed better separation between High and Low classes
  • It achieved the highest F1-score

Therefore, the Random Forest model was selected as the final classification model.


Final Model

The winning model was:

Random Forest Classifier

It was trained fully and exported as:

classification_winner.pkl


How to Load the Model

import pickle

with open("classification_winner.pkl", "rb") as f:
    model = pickle.load(f)

prediction = model.predict(X_sample)

## Conclusion

This project provided several key insights:

- Weight, height, and body ratio strongly influence deadlift performance
- Age shows a performance peak followed by decline
- Deadlift and back squat are closely related
- Classification models performed extremely well due to clear class separation
- Random Forest proved to be the most reliable model

This project demonstrates a full machine learning workflow, including:

- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export and deployment

The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.