# Strength Performance Analysis and Modeling ## Overview This project explores a dataset of athlete strength metrics to understand patterns in deadlift performance and to build models that can predict and classify athletes based on strength. The workflow includes: - Exploratory Data Analysis (EDA) - Feature engineering - Regression models - Classification models - Clustering - Model selection and export The final objective was to classify athletes into performance categories and evaluate which model performs best. --- ## Dataset The dataset contains: - Body weight - Height - Age - Strength metrics: deadlift, back squat, snatch After cleaning: - Duplicate rows were removed - Placeholder values were replaced - Unrealistic values were filtered - Missing key fields were dropped --- ## Exploratory Data Analysis (EDA) ### Average Deadlift by Body Weight ![img11](img11.png) Heavier weight groups generally show higher deadlift performance. ### Average Deadlift by Height ![img12](img12.png) Taller athletes tend to lift more, with higher variability at the upper height ranges. ### Average Deadlift by Age ![img13](img13.png) Performance peaks around ages 25–34 and gradually declines afterward. ### Body Ratio and Deadlift ![img14](img14.png) Higher weight-to-height ratios are associated with stronger lifts. ### Strength Metric Correlations ![img15](img15.png) Deadlift and back squat show a strong positive correlation, while snatch is only weakly related. --- ## Regression Modeling A baseline linear regression model was trained to predict deadlift performance. ### Actual vs Predicted Deadlift ![img16](img16.png) The model follows the general trend but shows noise due to differences between athletes. --- ## Clustering K-Means clustering was used to group athletes based on strength metrics. ### Cluster Visualization (PCA) ![img17](img17.png) Three performance clusters were identified, separating athletes by overall strength level. --- ## Classification Modeling Athletes were grouped into three balanced performance classes: - Low - Medium - High Models trained: - Logistic Regression - Random Forest - Gradient Boosting ### Confusion Matrices Logistic Regression: ![img18](img18.png) Random Forest: ![img19](img19.png) Gradient Boosting: ![img20](img20.png) --- ## Model Evaluation All models performed well across accuracy, precision, recall, and F1-score. Random Forest stood out because it: - Made fewer major misclassifications - Separated high and low performers better - Achieved the highest F1-score --- ## Final Model The final selected model: **Random Forest Classifier** It was trained on the full dataset and exported as: `best_classifier.pkl` --- ## How to Load the Model ```python import pickle with open("best_classifier.pkl", "rb") as f: model = pickle.load(f) prediction = model.predict(X_sample) ``` --- ## Conclusion This project provided several key insights: - Weight, height, and body ratio strongly influence deadlift performance - Performance peaks in the late 20s and declines afterward - Deadlift and back squat are closely related - Classification models performed very well due to clear class separation - Random Forest proved to be the most reliable model The work demonstrates a full machine learning workflow, including: - Data exploration - Feature engineering - Model training - Evaluation - Model selection - Export The final Random Forest model delivers strong performance and can be used to classify athletes into strength categories based on their physical and strength metrics. [![Project Presentation](https://img.youtube.com/vi/R0YGueMVqko/0.jpg)](https://youtu.be/R0YGueMVqko)