πŸ“Š Project Notebook

You can find the full analysis, code, and visualizations in the repository: πŸ“„ View Analysis Notebook

🎾 How long will this match last?

A pre-match duration predictor for ATP men's singles


🎯 The question

If you're a TV producer or a tournament organiser, you need to know how long a match will last β€” and you need to know it before a single point is played. Block too short a TV slot and the match overruns; block too long and you waste airtime.

This project trains models to answer that question using only information available before the toss:

Framing Target
Regression match duration in minutes
Classification Quick (<90 min) / Standard (90–130 min) / Long (>130 min)

🏟 The data

Source gmadevs/atp-matches-dataset on Kaggle
Origin Mirror of Jeff Sackmann's tennis_atp repository
Coverage ATP main-tour singles, 2000 β†’ mid-2015 (the gmadevs mirror was last updated in mid-2015)
Rows after cleaning 41,820 matches
Columns 49 raw + 12 engineered

Cleaning, in 5 bullet points

  • Dropped retirements, walkovers, and rows with a missing target.
  • Removed exact duplicates.
  • Parsed tourney_date to a real datetime; extracted year and month.
  • Replaced impossible heights (<150 or >230 cm) and ages (<14 or >45) with NaN; the pipeline imputes them later.
  • Flagged 18 leakage columns β€” per-match aces, double faults, break points faced, etc. They are not used as model features but are kept in the dataframe for EDA.

⚠️ The leakage trap (the most important design decision)

The 18 per-match stats correlate ~0.9 with minutes. Throwing them at the model would yield an unreal RΒ² and a useless predictor β€” those features simply do not exist before the match. They're "the answer hiding in plain sight." Every modelling choice in this project respects that boundary.


πŸ” First set β€” Six questions, six plots

1️⃣ Two distinct populations of match length

Best-of-3 matches centre near 90 min, best-of-5 near 140 min. Both are right-skewed.

Match Duration by Format

2️⃣ The "Grass is longest" illusion

In raw stats, Grass had the highest mean. But it was a confound β€” most grass = Wimbledon = best-of-5. Once split by format, the surface order matches tennis intuition: Clay > Hard > Grass > Carpet.

Surface Γ— Format

3️⃣ Round effect

R128 and R64 dominate the upper tail because those are the early Slam rounds (best-of-5).

Round Effect

4️⃣ Era trend

A slow upward drift in median duration over 2000–2015. Polyester strings, fitter players, slower courts.

Era Trend

5️⃣ Tall players hit more aces

Clear monotonic increase across height quartiles. (Note: aces are a leakage column β€” used here for the EDA story, not as a model feature.)

Height vs Aces

6️⃣ All top correlations are leakage

When you rank features by correlation with minutes, the top 15 are all per-match stats (w_svpt, l_svpt, w_SvGms, etc.) with correlations of 0.85–0.92. Confirms they cannot be used. The strongest legitimate signal is best_of.


πŸ›  Second set β€” Building better inputs

Twelve new features were added on top of the raw schema. Together they cover the four major angles: competitive imbalance, player history, geographic context, and playing style.

Group Features What they capture
Date components year, month Era and seasonality, extracted from tourney_date.
Gaps & ratios rank_gap, age_gap, height_gap, rank_points_ratio The competitive imbalance of the match in a single number per dimension.
Player history winner_recent_avg_min, loser_recent_avg_min Each player's average match duration over their previous 20 matches. Computed with .shift(1) to guarantee no leakage.
Country winner_country_freq, loser_country_freq Frequency encoding of nationality (avoids 240-column one-hot explosion).
K-Means clusters winner_cluster, loser_cluster Each player's playing-style archetype (see below).

🧩 Player-style clustering

Each of the 444 players with β‰₯20 matches is profiled across 6 serve-side features: ace rate, double-fault rate, first-serve in %, first-serve points won %, second-serve points won %, and break-point-faced rate.

K-Means (k=4, chosen via elbow plot) splits them into four playing styles:

Player Clusters in 2D β€” PCA

The heatmap below shows each cluster's z-score profile (red = above average, blue = below) β€” making the archetypes visible at a glance:

Cluster Heatmap

Cluster n Profile Tennis archetype
3 95 high ace, high first-serve dominance, low pressure Elite Server
2 138 low DF, steady on second serve, no extremes Steady All-Courter
1 95 high DF, low first-in, weak second serve Erratic
0 116 weak first serve, constantly under pressure Vulnerable Server

The cluster ID becomes a feature; players with <20 matches are flagged as cluster -1 ("unknown").


πŸ“ˆ Third set β€” Three regressors on the engineered features

Same 80/20 split (random_state=42). Same preprocessor recipe. Three model heads.

Regression Comparison

Model MAE (min) RMSE RΒ²
Baseline LR (raw features) 26.44 33.55 0.295
Improved LR (engineered) 26.12 33.18 0.311
Random Forest 26.33 33.28 0.307
πŸ† Gradient Boosting 26.00 32.95 0.320

πŸ€” Why the improvements are small but real

best_of alone accounts for ~64% of the model's decisions. Once that is captured (the baseline already had it), there's little room left for engineered features to push the score significantly. Match length also has substantial irreducible randomness β€” same conditions, two players, two completely different durations.

The gain that did come, came from winner_recent_avg_min and loser_recent_avg_min. The cluster IDs contributed only ~3% of total importance because rolling averages capture "playing style" in a more granular way:

Feature Importance


🎯 Fourth set β€” Reframing as classification

The continuous target was binned into three tennis-meaningful buckets:

Class Range Share
Quick <90 min 42.4%
Standard 90–130 min 35.3%
Long >130 min 22.3%

Mild imbalance β€” "Long" is the smallest class. Train/test split uses stratify=y_class to preserve proportions.

Precision or recall β€” what should we optimise?

For broadcast scheduling, recall on the "Long" class is the most important. Missing a long match (predicting Quick or Standard when it actually runs Long) leads to TV slot overruns and downstream programming chaos. Over-predicting Long just wastes airtime, which is recoverable. False negatives are the costly error. We report macro-F1 (which weights all three classes equally) instead of accuracy alone.

Three classifiers

Model Accuracy Macro F1
Logistic Regression 0.521 0.458
Random Forest 0.516 0.462
πŸ† Gradient Boosting 0.518 0.468

Confusion Matrix β€” Gradient Boosting Classifier

The confusion matrix tells the same story for every model: Standard is the hardest class to predict because it's the "middle" β€” matches near the 90-min and 130-min boundaries are easily confused with their neighbours. The model defaults to predicting Quick (the largest class) and under-predicts Long.


πŸ† Game, set, match

Task Winner Headline metric
Regression Gradient Boosting RΒ² 0.320 / MAE 26.0 min
Classification Gradient Boosting Macro F1 0.468 / Accuracy 51.8%

Both winners are pickled and live in this repo.


πŸ’­ Reflections β€” what I learned

The features I built worked, but only modestly. Match length has a soft ceiling for predictability with pre-match info only. The gain from feature engineering was real (β‰ˆ +0.025 RΒ²), but a real breakthrough would require info that isn't visible before the match β€” weather, injury reports, in-play form.

The clusters were the most fun part of the project but the smallest contribution. They identified four genuine playing-style archetypes, which is a great EDA story. But once we had each player's rolling average match duration, the cluster ID became partly redundant β€” the rolling avg captured playing-style information in a more granular way.

Country signal turned out to be weaker than I expected. It's mostly a proxy for "which dominant individual player happens to be from this country" rather than national style β€” Serbia is "long" because Djokovic plays long matches, not because Serbian tennis culture is uniquely grindy.

The leakage trap was the most important methodological choice. Resisting the temptation to use the 18 per-match stats forces the project to be intellectually honest. It's the difference between a model that looks impressive and one that's actually useful.


πŸ“ Files in this repo

File What it is
atp_gb_regressor.pkl The pickled winning regression model (Gradient Boosting Pipeline).
atp_classifier_winner.pkl The pickled winning classification model (Gradient Boosting Pipeline).
Assignment_2_Data_science_course.ipynb The full Colab notebook β€” every step end-to-end.
plots/ The 10 figures referenced in this README.

▢️ How to use the saved models

import pickle, pandas as pd
from huggingface_hub import hf_hub_download

# Load the regression model
path = hf_hub_download(
    repo_id="Kogann/atp-match-duration-regressor",
    filename="atp_gb_regressor.pkl",
)
with open(path, "rb") as f:
    regressor = pickle.load(f)

# X must contain the same engineered columns the model was trained on
predictions = regressor.predict(X_new)

The classifier loads identically β€” just swap to atp_classifier_winner.pkl.


πŸ” Reproducibility

Every split, model, and shuffle uses random_state = 42. Tooling: pandas, numpy, seaborn, matplotlib, scikit-learn. No SHAP β€” feature importance comes from feature_importances_ (trees) and .coef_ (linear models) only.


πŸ‘€ Author

Jonathan Kogan β€” Intro to Data Science course, Spring 2026. Educational project; not a production system.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support