🏀 Basketball Player Performance: Prediction & Classification Pipeline

Course: Introduction to Data Science | Reichman University

🎥 Project Presentation

▶️ Click here to watch the full project presentation on Loom

📌 Project Overview

This project leverages Machine Learning to analyze basketball player performance. We developed a comprehensive pipeline that includes Exploratory Data Analysis (EDA), Unsupervised Clustering, and Advanced Supervised Learning for both Regression and Classification tasks.

🚀 Links & Resources

Google Colab Notebook: Click here to view the full code
Model Repository: Hugging Face Project Page

📊 Exploratory Data Analysis (EDA)

Our analysis revealed key insights into how player attributes affect their scoring output.

Scoring Trends by Age	Playmaking vs. Scoring

🛠️ Feature Engineering & Clustering

We used K-Means Clustering to identify player archetypes based on physical metrics. To visualize these high-dimensional clusters, we applied PCA (Principal Component Analysis).

Cluster Visualization (PCA)

The plot shows how players are grouped into distinct physical profiles, which were later used as features in our predictive models.

🤖 Modeling & Evaluation

1. Regression (Predicting Points Per Game)

We evaluated multiple models to predict continuous scoring output. Our Gradient Boosting model significantly outperformed the baseline.

Feature Importance Comparison:

Random Forest	Gradient Boosting

2. Classification (Performance Tiers)

We reframed the problem to classify players into three tiers (Low, Medium, High). Given the high cost of "Draft Busts" in scouting, we optimized for Precision.

Winning Model Confusion Matrix: The matrix demonstrates the model's high reliability in identifying 'Star' players (Class 2) while minimizing False Positives.

💡 Key Conclusions

Feature Engineering is Crucial: Unsupervised learning (K-Means) successfully uncovered hidden physical archetypes, which proved to be strong predictors in our supervised models.
Non-Linearity Matters: Tree-based models (Gradient Boosting and Random Forest) significantly outperformed the baseline Linear Regression, highlighting the complex, non-linear relationships between a player's physical attributes, playstyle, and scoring output.
Business-Driven Metrics: In sports analytics, framing the problem around real-world business needs (e.g., prioritizing Precision to avoid costly draft busts) is just as important as the model's overall accuracy.

📂 Repository Contents

winning_model_gb.pkl: Final regression model.
winning_classifier_model.pkl: Final classification model.
*.png: All project visualizations.
README.md: Project documentation and presentation.

Developed by Nir Missri

Downloads last month: -; Downloads are not tracked for this model. How to track