Tomertg
/

Gradient_Boosting

Model card Files Files and versions

xet

Community

Tomertg commited on Nov 25, 2025

Commit

f941c6b

verified ·

1 Parent(s): e1364cb

Update README.md

Browse files

Files changed (1) hide show

README.md +38 -33

README.md CHANGED Viewed

@@ -2,31 +2,36 @@
 ## Overview
-This project analyzes a large dataset of athlete strength metrics to understand patterns in deadlift performance and build predictive and classification models.
-The work includes:
 - Exploratory Data Analysis (EDA)
 - Feature engineering
-- Regression modeling
-- Classification modeling
 - Clustering
 - Model selection and export
-The final goal was to classify athletes into performance categories and evaluate which model performs best.
 ---
 ## Dataset
-The dataset includes:
 - Body weight
 - Height
 - Age
 - Strength metrics: deadlift, back squat, snatch
-After cleaning, outliers were removed and missing values handled.
 ---
@@ -35,27 +40,27 @@ After cleaning, outliers were removed and missing values handled.
 ### Average Deadlift by Body Weight
 ![img11](img11.png)
-Heavier weight categories generally show higher deadlift performance.
 ### Average Deadlift by Height
 ![img12](img12.png)
-Taller athletes tend to lift more, with increasing variance at higher height ranges.
 ### Average Deadlift by Age
 ![img13](img13.png)
-Performance peaks around ages 25–34 and gradually decreases afterward.
 ### Body Ratio and Deadlift
 ![img14](img14.png)
-Higher strength-to-body weight ratios correlate with higher deadlift results.
 ### Strength Metric Correlations
 ![img15](img15.png)
-Deadlift and back squat show a strong positive correlation, while snatch is weakly correlated.
 ---
@@ -66,24 +71,24 @@ A baseline linear regression model was trained to predict deadlift performance.
 ### Actual vs Predicted Deadlift
 ![img16](img16.png)
-The model follows the general trend but shows noise due to variability between athletes.
 ---
 ## Clustering
-K-Means clustering was applied to identify athlete groups based on performance metrics.
 ### Cluster Visualization (PCA)
 ![img17](img17.png)
-Three clear performance clusters were identified, separating athletes by overall strength level.
 ---
 ## Classification Modeling
-Athletes were categorized into three balanced deadlift performance classes:
 - Low
 - Medium
@@ -110,27 +115,25 @@ Gradient Boosting:
 ## Model Evaluation
-All models achieved high accuracy, precision, recall, and F1-score.
-However:
-- Random Forest made fewer critical misclassifications
-- It showed better separation between High and Low classes
-- It achieved the highest F1-score
-Therefore, the Random Forest model was selected as the final classification model.
 ---
 ## Final Model
-The winning model was:
-Random Forest Classifier
-It was trained fully and exported as:
-`classification_winner.pkl`
 ---
@@ -139,28 +142,30 @@ It was trained fully and exported as:
 ```python
 import pickle
-with open("classification_winner.pkl", "rb") as f:
     model = pickle.load(f)
 prediction = model.predict(X_sample)
 ## Conclusion
 This project provided several key insights:
 - Weight, height, and body ratio strongly influence deadlift performance
-- Age shows a performance peak followed by decline
 - Deadlift and back squat are closely related
-- Classification models performed extremely well due to clear class separation
 - Random Forest proved to be the most reliable model
-This project demonstrates a full machine learning workflow, including:
 - Data exploration
 - Feature engineering
 - Model training
 - Evaluation
 - Model selection
-- Export and deployment
-The final Random Forest model offers strong predictive performance and can be used to classify athletes into performance categories based on their physical and strength metrics.

 ## Overview
+This project explores a dataset of athlete strength metrics to understand patterns in deadlift performance and to build models that can predict and classify athletes based on strength.
+The workflow includes:
 - Exploratory Data Analysis (EDA)
 - Feature engineering
+- Regression models
+- Classification models
 - Clustering
 - Model selection and export
+The final objective was to classify athletes into performance categories and evaluate which model performs best.
 ---
 ## Dataset
+The dataset contains:
 - Body weight
 - Height
 - Age
 - Strength metrics: deadlift, back squat, snatch
+After cleaning:
+- Duplicate rows were removed
+- Placeholder values were replaced
+- Unrealistic values were filtered
+- Missing key fields were dropped
 ---
 ### Average Deadlift by Body Weight
 ![img11](img11.png)
+Heavier weight groups generally show higher deadlift performance.
 ### Average Deadlift by Height
 ![img12](img12.png)
+Taller athletes tend to lift more, with higher variability at the upper height ranges.
 ### Average Deadlift by Age
 ![img13](img13.png)
+Performance peaks around ages 25–34 and gradually declines afterward.
 ### Body Ratio and Deadlift
 ![img14](img14.png)
+Higher weight-to-height ratios are associated with stronger lifts.
 ### Strength Metric Correlations
 ![img15](img15.png)
+Deadlift and back squat show a strong positive correlation, while snatch is only weakly related.
 ---
 ### Actual vs Predicted Deadlift
 ![img16](img16.png)
+The model follows the general trend but shows noise due to differences between athletes.
 ---
 ## Clustering
+K-Means clustering was used to group athletes based on strength metrics.
 ### Cluster Visualization (PCA)
 ![img17](img17.png)
+Three performance clusters were identified, separating athletes by overall strength level.
 ---
 ## Classification Modeling
+Athletes were grouped into three balanced performance classes:
 - Low
 - Medium
 ## Model Evaluation
+All models performed well across accuracy, precision, recall, and F1-score.
+Random Forest stood out because it:
+- Made fewer major misclassifications
+- Separated high and low performers better
+- Achieved the highest F1-score
 ---
 ## Final Model
+The final selected model:
+**Random Forest Classifier**
+It was trained on the full dataset and exported as:
+`best_classifier.pkl`
 ---
 ```python
 import pickle
+with open("best_classifier.pkl", "rb") as f:
     model = pickle.load(f)
 prediction = model.predict(X_sample)
 ## Conclusion
 This project provided several key insights:
 - Weight, height, and body ratio strongly influence deadlift performance
+- Performance peaks in the late 20s and declines afterward
 - Deadlift and back squat are closely related
+- Classification models performed very well due to clear class separation
 - Random Forest proved to be the most reliable model
+The work demonstrates a full machine learning workflow, including:
 - Data exploration
 - Feature engineering
 - Model training
 - Evaluation
 - Model selection
+- Export
+The final Random Forest model delivers strong performance and can be used to classify athletes into strength categories based on their physical and strength metrics.