Gradient_Boosting / README.md

Tomertg

Update README.md

97bb51e verified 2 months ago

preview code

raw

history blame

3.63 kB

Strength Performance Analysis and Modeling

Overview

This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.

The work includes:

Exploratory Data Analysis (EDA)
Feature engineering
Regression modeling
Classification modeling
Clustering
Model selection and export

The final goal was to classify athletes into performance categories and evaluate which model performs best.

Dataset

The dataset includes:

Body weight
Height
Age
Strength metrics: deadlift, back squat, snatch

After cleaning, outliers were removed and missing values handled.

Exploratory Data Analysis (EDA)

Average Deadlift by Body Weight

Heavier weight categories generally show higher deadlift performance.

Average Deadlift by Height

Taller athletes tend to lift more, with increasing variance at higher height ranges.

Average Deadlift by Age

Performance peaks around ages 25–34 and gradually decreases afterward.

Body Ratio and Deadlift

Higher strength-to-body weight ratios correlate with higher deadlift results.

Strength Metric Correlations

Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.

Regression Modeling

A baseline linear regression model was trained to predict deadlift performance.

Actual vs Predicted Deadlift

The model follows the general trend but shows noise due to variability between athletes.

Clustering

K-Means clustering was applied to identify athlete groups based on performance metrics.

Cluster Visualization (PCA)

Three clear performance clusters were identified, separating athletes by overall strength level.

Classification Modeling

Athletes were categorized into three balanced deadlift performance classes:

Low
Medium
High

Models trained:

Logistic Regression
Random Forest
Gradient Boosting

Confusion Matrices

Logistic Regression:

Random Forest:

Gradient Boosting:

Model Evaluation

All models achieved high accuracy, precision, recall, and F1-score.

However:

Random Forest made fewer critical misclassifications
It showed better separation between High and Low classes
It achieved the highest F1-score

Therefore, the Random Forest model was selected as the final classification model.

Final Model

The winning model was:

Random Forest Classifier

It was trained fully and exported as:

classification_winner.pkl

How to Load the Model

import pickle

with open("classification_winner.pkl", "rb") as f:
    model = pickle.load(f)

prediction = model.predict(X_sample)

## Conclusion

This project provided several key insights:

- Weight, height, and body ratio strongly influence deadlift performance
- Age shows a performance peak followed by decline
- Deadlift and back squat are closely related
- Classification models performed extremely well due to clear class separation
- Random Forest proved to be the most reliable model

This project demonstrates a full machine learning workflow, including:

- Data exploration
- Feature engineering
- Model training
- Evaluation
- Model selection
- Export and deployment

The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.