- πΎ How long will this match last?
- π― The question
- π The data
- π First set β Six questions, six plots
- π Second set β Building better inputs
- π Third set β Three regressors on the engineered features
- π― Fourth set β Reframing as classification
- π Game, set, match
- π Reflections β what I learned
- π Files in this repo
- βΆοΈ How to use the saved models
- π Reproducibility
- π€ Author
- π― The question
π Project Notebook
You can find the full analysis, code, and visualizations in the repository: π View Analysis Notebook
πΎ How long will this match last?
A pre-match duration predictor for ATP men's singles
π― The question
If you're a TV producer or a tournament organiser, you need to know how long a match will last β and you need to know it before a single point is played. Block too short a TV slot and the match overruns; block too long and you waste airtime.
This project trains models to answer that question using only information available before the toss:
| Framing | Target |
|---|---|
| Regression | match duration in minutes |
| Classification | Quick (<90 min) / Standard (90β130 min) / Long (>130 min) |
π The data
| Source | gmadevs/atp-matches-dataset on Kaggle |
| Origin | Mirror of Jeff Sackmann's tennis_atp repository |
| Coverage | ATP main-tour singles, 2000 β mid-2015 (the gmadevs mirror was last updated in mid-2015) |
| Rows after cleaning | 41,820 matches |
| Columns | 49 raw + 12 engineered |
Cleaning, in 5 bullet points
- Dropped retirements, walkovers, and rows with a missing target.
- Removed exact duplicates.
- Parsed
tourney_dateto a real datetime; extractedyearandmonth. - Replaced impossible heights (<150 or >230 cm) and ages (<14 or >45) with NaN; the pipeline imputes them later.
- Flagged 18 leakage columns β per-match aces, double faults, break points faced, etc. They are not used as model features but are kept in the dataframe for EDA.
β οΈ The leakage trap (the most important design decision)
The 18 per-match stats correlate ~0.9 with minutes. Throwing them at the model would yield an unreal RΒ² and a useless predictor β those features simply do not exist before the match. They're "the answer hiding in plain sight." Every modelling choice in this project respects that boundary.
π First set β Six questions, six plots
1οΈβ£ Two distinct populations of match length
Best-of-3 matches centre near 90 min, best-of-5 near 140 min. Both are right-skewed.
2οΈβ£ The "Grass is longest" illusion
In raw stats, Grass had the highest mean. But it was a confound β most grass = Wimbledon = best-of-5. Once split by format, the surface order matches tennis intuition: Clay > Hard > Grass > Carpet.
3οΈβ£ Round effect
R128 and R64 dominate the upper tail because those are the early Slam rounds (best-of-5).
4οΈβ£ Era trend
A slow upward drift in median duration over 2000β2015. Polyester strings, fitter players, slower courts.
5οΈβ£ Tall players hit more aces
Clear monotonic increase across height quartiles. (Note: aces are a leakage column β used here for the EDA story, not as a model feature.)
6οΈβ£ All top correlations are leakage
When you rank features by correlation with minutes, the top 15 are all per-match stats (w_svpt, l_svpt, w_SvGms, etc.) with correlations of 0.85β0.92. Confirms they cannot be used. The strongest legitimate signal is best_of.
π Second set β Building better inputs
Twelve new features were added on top of the raw schema. Together they cover the four major angles: competitive imbalance, player history, geographic context, and playing style.
| Group | Features | What they capture |
|---|---|---|
| Date components | year, month |
Era and seasonality, extracted from tourney_date. |
| Gaps & ratios | rank_gap, age_gap, height_gap, rank_points_ratio |
The competitive imbalance of the match in a single number per dimension. |
| Player history | winner_recent_avg_min, loser_recent_avg_min |
Each player's average match duration over their previous 20 matches. Computed with .shift(1) to guarantee no leakage. |
| Country | winner_country_freq, loser_country_freq |
Frequency encoding of nationality (avoids 240-column one-hot explosion). |
| K-Means clusters | winner_cluster, loser_cluster |
Each player's playing-style archetype (see below). |
π§© Player-style clustering
Each of the 444 players with β₯20 matches is profiled across 6 serve-side features: ace rate, double-fault rate, first-serve in %, first-serve points won %, second-serve points won %, and break-point-faced rate.
K-Means (k=4, chosen via elbow plot) splits them into four playing styles:
The heatmap below shows each cluster's z-score profile (red = above average, blue = below) β making the archetypes visible at a glance:
| Cluster | n | Profile | Tennis archetype |
|---|---|---|---|
| 3 | 95 | high ace, high first-serve dominance, low pressure | Elite Server |
| 2 | 138 | low DF, steady on second serve, no extremes | Steady All-Courter |
| 1 | 95 | high DF, low first-in, weak second serve | Erratic |
| 0 | 116 | weak first serve, constantly under pressure | Vulnerable Server |
The cluster ID becomes a feature; players with <20 matches are flagged as cluster -1 ("unknown").
π Third set β Three regressors on the engineered features
Same 80/20 split (random_state=42). Same preprocessor recipe. Three model heads.
| Model | MAE (min) | RMSE | RΒ² |
|---|---|---|---|
| Baseline LR (raw features) | 26.44 | 33.55 | 0.295 |
| Improved LR (engineered) | 26.12 | 33.18 | 0.311 |
| Random Forest | 26.33 | 33.28 | 0.307 |
| π Gradient Boosting | 26.00 | 32.95 | 0.320 |
π€ Why the improvements are small but real
best_of alone accounts for ~64% of the model's decisions. Once that is captured (the baseline already had it), there's little room left for engineered features to push the score significantly. Match length also has substantial irreducible randomness β same conditions, two players, two completely different durations.
The gain that did come, came from winner_recent_avg_min and loser_recent_avg_min. The cluster IDs contributed only ~3% of total importance because rolling averages capture "playing style" in a more granular way:
π― Fourth set β Reframing as classification
The continuous target was binned into three tennis-meaningful buckets:
| Class | Range | Share |
|---|---|---|
| Quick | <90 min | 42.4% |
| Standard | 90β130 min | 35.3% |
| Long | >130 min | 22.3% |
Mild imbalance β "Long" is the smallest class. Train/test split uses stratify=y_class to preserve proportions.
Precision or recall β what should we optimise?
For broadcast scheduling, recall on the "Long" class is the most important. Missing a long match (predicting Quick or Standard when it actually runs Long) leads to TV slot overruns and downstream programming chaos. Over-predicting Long just wastes airtime, which is recoverable. False negatives are the costly error. We report macro-F1 (which weights all three classes equally) instead of accuracy alone.
Three classifiers
| Model | Accuracy | Macro F1 |
|---|---|---|
| Logistic Regression | 0.521 | 0.458 |
| Random Forest | 0.516 | 0.462 |
| π Gradient Boosting | 0.518 | 0.468 |
The confusion matrix tells the same story for every model: Standard is the hardest class to predict because it's the "middle" β matches near the 90-min and 130-min boundaries are easily confused with their neighbours. The model defaults to predicting Quick (the largest class) and under-predicts Long.
π Game, set, match
| Task | Winner | Headline metric |
|---|---|---|
| Regression | Gradient Boosting | RΒ² 0.320 / MAE 26.0 min |
| Classification | Gradient Boosting | Macro F1 0.468 / Accuracy 51.8% |
Both winners are pickled and live in this repo.
π Reflections β what I learned
The features I built worked, but only modestly. Match length has a soft ceiling for predictability with pre-match info only. The gain from feature engineering was real (β +0.025 RΒ²), but a real breakthrough would require info that isn't visible before the match β weather, injury reports, in-play form.
The clusters were the most fun part of the project but the smallest contribution. They identified four genuine playing-style archetypes, which is a great EDA story. But once we had each player's rolling average match duration, the cluster ID became partly redundant β the rolling avg captured playing-style information in a more granular way.
Country signal turned out to be weaker than I expected. It's mostly a proxy for "which dominant individual player happens to be from this country" rather than national style β Serbia is "long" because Djokovic plays long matches, not because Serbian tennis culture is uniquely grindy.
The leakage trap was the most important methodological choice. Resisting the temptation to use the 18 per-match stats forces the project to be intellectually honest. It's the difference between a model that looks impressive and one that's actually useful.
π Files in this repo
| File | What it is |
|---|---|
atp_gb_regressor.pkl |
The pickled winning regression model (Gradient Boosting Pipeline). |
atp_classifier_winner.pkl |
The pickled winning classification model (Gradient Boosting Pipeline). |
Assignment_2_Data_science_course.ipynb |
The full Colab notebook β every step end-to-end. |
plots/ |
The 10 figures referenced in this README. |
βΆοΈ How to use the saved models
import pickle, pandas as pd
from huggingface_hub import hf_hub_download
# Load the regression model
path = hf_hub_download(
repo_id="Kogann/atp-match-duration-regressor",
filename="atp_gb_regressor.pkl",
)
with open(path, "rb") as f:
regressor = pickle.load(f)
# X must contain the same engineered columns the model was trained on
predictions = regressor.predict(X_new)
The classifier loads identically β just swap to atp_classifier_winner.pkl.
π Reproducibility
Every split, model, and shuffle uses random_state = 42. Tooling: pandas, numpy, seaborn, matplotlib, scikit-learn. No SHAP β feature importance comes from feature_importances_ (trees) and .coef_ (linear models) only.
π€ Author
Jonathan Kogan β Intro to Data Science course, Spring 2026. Educational project; not a production system.









