YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Final Project - ATP Tennis Prediction

Video Presentation Link: https://www.youtube.com/watch?v=oQHF4wT8Y-o

ATP Tennis Match Prediction — Full ML Pipeline This project builds a complete machine learning workflow to predict competitive outcomes in the ATP Tennis Tour. It includes data cleaning, exploratory data analysis (EDA), advanced feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.

Part 0 — Initial Research Questions (EDA) Before any modeling, I asked a few basic questions about the dataset: 1️) What is the relationship between official player rank/points and the expected competitive gap? A clear positive trend exists. The difference in player ranking is the dominant factor in predicting the outcome.

2️) Is there a strong relationship between the match surface and the outcome? No strong single pattern, but variations in volatility were observed across surfaces (Hard, Clay, Grass), confirming the need to include Surface as a feature.

These EDA steps helped build intuition before moving into modeling.

Main ML Research Questions

1️) Can we accurately predict the continuous competitive gap (point difference) using pre-match data? We test multiple regression models (Random Forest, Gradient Boosting) and evaluate how well different features explain the outcome.

2️) Which features have the strongest impact on the match outcome? We explore the importance of official rankings, match environment, and our new Cluster ID.

3️) Can we reliably classify matches into 'Easy Win' vs. 'Hard Fought' categories? We convert the outcome into three balanced categorical groups and apply classification models.

4️) Do clustering and unsupervised learning reveal meaningful structure in the dataset? We used K-Means to explore hidden groups and natural segmentation of matches.

Part 1 — Dataset & Basic Cleaning

Dataset: ATP Tennis Match Data.
Target variable: Target_Pts_Diff (Continuous).

Basic Cleaning:

Applied Standardization (StandardScaler) to numeric features like player ranks.
Converted categorical fields (Surface, Round) using One-Hot Encoding.

Critical Leakage Handling: A key step was the identification and correction of a severe Data Leakage issue. The raw rank points (Pts_1 and Pts_2) were highly correlated with the target. These variables were permanently removed from the predictor set to produce a smaller but more reliable dataset.

Part 2 — Initial EDA (Before Any Model)

As part of the necessary data cleaning phase, we performed several operations to prepare the dataset: First, we defined the target variable for our regression as the difference in rating points (Target_Pts_Diff). Second, we handled the date column by converting its format and creating a Year attribute. Finally, we removed duplicate rows (Duplicates) to ensure data purity for the model. A check for missing values (Null) confirmed the current state of the data. Question:What is the target variable distribution and are there outliers? Answer:The distribution shows most games are close (median near 0), but many outliers exist in both directions, indicating extreme rank differences (16-516). Question:Does the Rank Points structure confirm the need for outlier handling? Answer:The plot confirms the pyramid structure of tennis ranking. Numerous outliers upward (up to 16,950 points) prove that elite players are the outliers. This reinforces the decision not to remove them. Question:What are the strongest linear predictors for the target variable? Answer:The Heatmap and correlations show Rank Points are key: Pts_1 ( +0.66 ) and Pts_2 ( −0.65 ) are the strongest predictors. This proves a symmetrical and opposite relationship, justifying the use of a linear model.

Part 3 — Baseline Regression (Before Feature Engineering)

Goal : Build a simple baseline model that predicts the match's point difference (Target_Pts_Diff) using only a few basic features:

Normalized Rank_1
Normalized Rank_2
Encoded Surface

Model Linear Regression on the basic features. Train/Test split: 80% train / 20% test.

Baseline Regression Results Using only the basic features:

MAE ≈ 1008.65
RMSE ≈ 1740.09
R² ≈ 0.494

The linear regression model was used to create a reliable baseline, using only initial features. The model performed reasonably well, explaining approximately 49.43% of the variance in the target variable. However, the key metric is the prediction error: a high RMSE score of 1740.09 compared to an MAE score of 1008.65 demonstrates that the model has significant difficulty dealing with extreme values and the nonlinear complexity of the tennis data. This conclusion requires extensive feature development and a move to advanced models such as Random Forest to significantly improve predictive power.

Question:Does the current Linear Model correctly identify the magnitude of influence of the Rank features? Answer: The bar chart shows that Best of (number of sets) has the highest coefficient. This is a misleading conclusion because the numerical data has not been scaled. The high coefficient of Best of is intended to compensate for the small range of its values (3 or 5 sets). The really important predictors are the ranks, but their huge range has an inverse effect on their coefficients. Question: Does the model succeed in predicting difficult matches (extreme points difference)? Answer: The scatter plot (Actual vs. Predicted) shows that the model fails to predict extreme ranges. The data is very far from the ideal prediction line (red line), which illustrates the low R² score (0.49) and the weakness of the model in the high ranges.

Part 4 — Feature Engineering (Upgrading the Dataset)

To improve model performance, several new features were engineered:

Clustering-Based Features

K-Means Clustering was applied to match characteristics (surface, round, etc.) to segment the dataset into distinct match types.
Cluster ID was created as a new feature assigned to each match, indicating its type. This visualization shows the clusters we created using the clustering algorithm. This is after we reduced the dimensionality using principal components analysis. The axes (component axis 1 and component axis 2) together explain all the variance in the data. The colored clusters show five different game groups, each of which was attached as a new feature, a cluster ID, to each row of data. By doing so, we significantly enhanced the predictive power of our models.

Scaling Numerical Features

Used StandardScaler to standardize numeric columns (normalized player ranks).
Why Scaling? Scaling prevents high-magnitude features (even normalized ranks) from dominating the model training process, especially in the distance calculations required for K-Means clustering.
Each feature was transformed to have:
- mean ≈ 0
- standard deviation ≈ 1 After the data was processed and improved, we performed data splitting to create a training set (80% of the data) and a test set (20% of the data). This step is critical in Machine Learning: we train the model only on the training set, and then use the test set to evaluate the model's generalizability to new and unfamiliar data.

Part 5 — Train and Evaluate Improved Models

--- REGRESSION MODELS PERFORMANCE SUMMARY --- R2 RMSE MAE Time (s) *Random Forest Regressor 0.981294 334.655182 150.477854 10.48331 *Gradient Boosting Regressor 0.972423 406.332487 199.588539 2.174265 *Linear Regression (Improved) 0.494262 1740.086199 1008.653804 0.010618

In this step, we trained and evaluated three different regression models: improved linear regression, random forest, and gradient boosting. All models were trained on our improved dataset (which includes the new clustering feature) and evaluated on the test data. We used R2 (the measure of explanation), RMSE, and MAE (the measure of error) to compare their performance. The goal of this step is to select the model that achieves the highest R2 score and the lowest error values, thereby declaring it the winning model for predicting the point difference. Random Forest Regressor is the winning model (R2=0.9813).

Top 5 Most Important Features: Feature Importance Rank_1 0.435930 Rank_2 0.411957 Cluster_ID 0.102482 Year 0.017833 Odd_2 0.017577

The model chosen as the winner for the regression problem is Random Forest Regressor, which achieved the highest R2 score (approximately 0.98) and the lowest errors. Ensemble models significantly outperform linear regression due to their ability to capture nonlinear relationships in the data. Feature importance analysis showed that the player ratings are the decisive features, but more importantly, the new feature we created, Cluster_ID, was ranked as the third most important feature, thus confirming the success of our feature engineering.

Part 6: Winning Model

https://huggingface.co/galcomis/ATP_tennis_regressor/blob/main/random_forest_regressor_atp.pkl

part 7 : Regression-to-Classification

Before classification modeling could begin, the continuous target variable ($\text{Target_Pts_Diff}$) had to be converted into discrete categories. We implemented a Quantile Binning strategy to create three balanced classes: 'Hard' (close game), 'Medium', and 'Easy' (dominant win). The logic was carefully corrected to ensure that the largest point differences are correctly assigned to the 'Easy' class, while the smallest differences are assigned to 'Hard'. The binning process was based on the 33% and 66% quantiles of the data. The resulting boundaries were scores less than or equal to -382.87 for the 'Hard' class, and scores greater than 356.00 for the 'Easy' class. This creation of a new 'Target_Category' feature finalized the data preparation for the supervised classification models.

New Target Distribution (Train Set): Target_Category Easy 0.339889 Medium 0.330080 Hard 0.330031

After creating the new classes, we checked the balance of the classes. The output confirms that our use of percentiles along with stratified splitting was able to achieve a near-perfect balance between the three categories ('easy', 'medium', 'difficult'). Each class represents about a third of the data. This balance is critical: because the data is balanced, we can rely on Accuracy as our main evaluation metric for comparing classification models, without having to focus solely on metrics like F1 score.

Part 8: Train & Eval Classification Models

*In the task of predicting the intensity of victory (Easy, Medium, Hard), the main objective is to provide a reliable and balanced prediction across all categories. Therefore, the most important metric for evaluating our models' performance is the F1 Score, which is the harmonic mean of Precision (ensuring that when the model makes a prediction, it is correct) and Recall (ensuring the model does not miss relevant cases). If we were forced to choose, Precision is generally more important when predicting dominance ('Easy'), as we want to be certain that a prediction of dominance is indeed accurate. *I chose to focus on analyzing false positives and false negatives as the most critical errors. From a business perspective, a false negative on a ‘hard’ category is the most serious error, as it leads to underestimating the quality of the game and losing potential revenue from pricing tickets too low. Therefore, the tendency is to maximize the coverage index on the ‘hard’ category, and in general, to use the balanced score (F1 score) as the primary metric, as it ensures balance between all metrics and categories.

--- CLASSIFICATION MODELS PERFORMANCE SUMMARY --- Accuracy F1 Score Time (s) Logistic Regression 0.94658 0.946378 1.901037 Random Forest Classifier 0.944806 0.944928 1.591301 K-Nearest Neighbors (KNN) 0.888823 0.888301 0.020602

*"After training and evaluating three different classification models on the composite data, the Logistic Regression model was declared the winner. This model achieved the highest accuracy and F1 score, approximately 95%. The high score indicates the model’s excellent ability to correctly classify the strength of the win (easy, medium, hard), based on the similarity of the features to previous games. This is the final model selected for the classification prediction.

Model: Logistic Regression Classification Report: precision recall f1-score support

    Easy       0.95      0.97      0.96      1724
    Hard       0.95      0.97      0.96      1674
  Medium       0.93      0.91      0.92      1675

accuracy                           0.95      5073

macro avg 0.95 0.95 0.95 5073 weighted avg 0.95 0.95 0.95 5073

The classification phase culminated in exceptional performance; the winning model achieved a high F1 score of 0.95 and was able to accurately discriminate between the three difficulty categories (easy, medium, hard). This success confirms the tremendous predictive power of our engineered features. This ability to classify games provides significant strategic value by allowing us to predict the nature of the battle (easy vs. hard) before the game begins.

Model: K-Nearest Neighbors (KNN) Classification Report: precision recall f1-score support

    Easy       0.90      0.92      0.91      1724
    Hard       0.91      0.92      0.92      1674
  Medium       0.85      0.82      0.84      1675

accuracy                           0.89      5073

macro avg 0.89 0.89 0.89 5073 weighted avg 0.89 0.89 0.89 5073

The K-Nearest Neighbors (KNN) model was tested as an alternative classification model. The model performed reasonably well with an average F1 score of 0.89. However, confusion matrix analysis showed that the model had significant difficulty classifying games of medium difficulty. Many games in the medium category were incorrectly classified as easy or hard. This weakness in handling the intermediate range is why Logistic Regression was chosen as the final classification model.

Model: Random Forest Classifier Classification Report: precision recall f1-score support

    Easy       0.97      0.95      0.96      1724
    Hard       0.96      0.96      0.96      1674
  Medium       0.91      0.93      0.92      1675

accuracy                           0.94      5073

macro avg 0.94 0.94 0.94 5073 weighted avg 0.95 0.94 0.94 5073

The Random Forest Classifier model was tested as another ensemble-based model for classification. The model performed excellently with an average F1 score of 0.94 and an accuracy of 0.94. Confusion matrix analysis confirms the model’s peak performance, with the vast majority of predictions falling on the diagonal. Despite this excellent performance, the Logistic Regression model was chosen as the final winner due to its higher F1 score (0.95) and superior computational efficiency.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support