📊 Project Notebook

You can find the full analysis, code, and visualizations in the repository: 📄 View Analysis Notebook

🎾 How long will this match last?

A pre-match duration predictor for ATP men's singles

🎯 The question

If you're a TV producer or a tournament organiser, you need to know how long a match will last — and you need to know it before a single point is played. Block too short a TV slot and the match overruns; block too long and you waste airtime.

This project trains models to answer that question using only information available before the toss:

Framing	Target
Regression	match duration in minutes
Classification	Quick (<90 min) / Standard (90–130 min) / Long (>130 min)

🏟 The data


Source	`gmadevs/atp-matches-dataset` on Kaggle
Origin	Mirror of Jeff Sackmann's `tennis_atp` repository
Coverage	ATP main-tour singles, 2000 → mid-2015 (the gmadevs mirror was last updated in mid-2015)
Rows after cleaning	41,820 matches
Columns	49 raw + 12 engineered

Cleaning, in 5 bullet points

Dropped retirements, walkovers, and rows with a missing target.
Removed exact duplicates.
Parsed tourney_date to a real datetime; extracted year and month.
Replaced impossible heights (<150 or >230 cm) and ages (<14 or >45) with NaN; the pipeline imputes them later.
Flagged 18 leakage columns — per-match aces, double faults, break points faced, etc. They are not used as model features but are kept in the dataframe for EDA.

⚠️ The leakage trap (the most important design decision)

The 18 per-match stats correlate ~0.9 with minutes. Throwing them at the model would yield an unreal R² and a useless predictor — those features simply do not exist before the match. They're "the answer hiding in plain sight." Every modelling choice in this project respects that boundary.

🔍 First set — Six questions, six plots

1️⃣ Two distinct populations of match length

Best-of-3 matches centre near 90 min, best-of-5 near 140 min. Both are right-skewed.

2️⃣ The "Grass is longest" illusion

In raw stats, Grass had the highest mean. But it was a confound — most grass = Wimbledon = best-of-5. Once split by format, the surface order matches tennis intuition: Clay > Hard > Grass > Carpet.

3️⃣ Round effect

R128 and R64 dominate the upper tail because those are the early Slam rounds (best-of-5).

4️⃣ Era trend

A slow upward drift in median duration over 2000–2015. Polyester strings, fitter players, slower courts.

5️⃣ Tall players hit more aces

Clear monotonic increase across height quartiles. (Note: aces are a leakage column — used here for the EDA story, not as a model feature.)

6️⃣ All top correlations are leakage

When you rank features by correlation with minutes, the top 15 are all per-match stats (w_svpt, l_svpt, w_SvGms, etc.) with correlations of 0.85–0.92. Confirms they cannot be used. The strongest legitimate signal is best_of.

🛠 Second set — Building better inputs

Twelve new features were added on top of the raw schema. Together they cover the four major angles: competitive imbalance, player history, geographic context, and playing style.

Group	Features	What they capture
Date components	`year`, `month`	Era and seasonality, extracted from `tourney_date`.
Gaps & ratios	`rank_gap`, `age_gap`, `height_gap`, `rank_points_ratio`	The competitive imbalance of the match in a single number per dimension.
Player history	`winner_recent_avg_min`, `loser_recent_avg_min`	Each player's average match duration over their previous 20 matches. Computed with `.shift(1)` to guarantee no leakage.
Country	`winner_country_freq`, `loser_country_freq`	Frequency encoding of nationality (avoids 240-column one-hot explosion).
K-Means clusters	`winner_cluster`, `loser_cluster`	Each player's playing-style archetype (see below).

🧩 Player-style clustering

Each of the 444 players with ≥20 matches is profiled across 6 serve-side features: ace rate, double-fault rate, first-serve in %, first-serve points won %, second-serve points won %, and break-point-faced rate.

K-Means (k=4, chosen via elbow plot) splits them into four playing styles:

The heatmap below shows each cluster's z-score profile (red = above average, blue = below) — making the archetypes visible at a glance:

Cluster	n	Profile	Tennis archetype
3	95	high ace, high first-serve dominance, low pressure	Elite Server
2	138	low DF, steady on second serve, no extremes	Steady All-Courter
1	95	high DF, low first-in, weak second serve	Erratic
0	116	weak first serve, constantly under pressure	Vulnerable Server

The cluster ID becomes a feature; players with <20 matches are flagged as cluster -1 ("unknown").

📈 Third set — Three regressors on the engineered features

Same 80/20 split (random_state=42). Same preprocessor recipe. Three model heads.

Model	MAE (min)	RMSE	R²
Baseline LR (raw features)	26.44	33.55	0.295
Improved LR (engineered)	26.12	33.18	0.311
Random Forest	26.33	33.28	0.307
🏆 Gradient Boosting	26.00	32.95	0.320

🤔 Why the improvements are small but real

best_of alone accounts for ~64% of the model's decisions. Once that is captured (the baseline already had it), there's little room left for engineered features to push the score significantly. Match length also has substantial irreducible randomness — same conditions, two players, two completely different durations.

The gain that did come, came from winner_recent_avg_min and loser_recent_avg_min. The cluster IDs contributed only ~3% of total importance because rolling averages capture "playing style" in a more granular way:

🎯 Fourth set — Reframing as classification

The continuous target was binned into three tennis-meaningful buckets:

Class	Range	Share
Quick	<90 min	42.4%
Standard	90–130 min	35.3%
Long	>130 min	22.3%

Mild imbalance — "Long" is the smallest class. Train/test split uses stratify=y_class to preserve proportions.

Precision or recall — what should we optimise?

For broadcast scheduling, recall on the "Long" class is the most important. Missing a long match (predicting Quick or Standard when it actually runs Long) leads to TV slot overruns and downstream programming chaos. Over-predicting Long just wastes airtime, which is recoverable. False negatives are the costly error. We report macro-F1 (which weights all three classes equally) instead of accuracy alone.

Three classifiers

Model	Accuracy	Macro F1
Logistic Regression	0.521	0.458
Random Forest	0.516	0.462
🏆 Gradient Boosting	0.518	0.468

The confusion matrix tells the same story for every model: Standard is the hardest class to predict because it's the "middle" — matches near the 90-min and 130-min boundaries are easily confused with their neighbours. The model defaults to predicting Quick (the largest class) and under-predicts Long.

🏆 Game, set, match

Task	Winner	Headline metric
Regression	Gradient Boosting	R² 0.320 / MAE 26.0 min
Classification	Gradient Boosting	Macro F1 0.468 / Accuracy 51.8%

Both winners are pickled and live in this repo.

💭 Reflections — what I learned

The features I built worked, but only modestly. Match length has a soft ceiling for predictability with pre-match info only. The gain from feature engineering was real (≈ +0.025 R²), but a real breakthrough would require info that isn't visible before the match — weather, injury reports, in-play form.

The clusters were the most fun part of the project but the smallest contribution. They identified four genuine playing-style archetypes, which is a great EDA story. But once we had each player's rolling average match duration, the cluster ID became partly redundant — the rolling avg captured playing-style information in a more granular way.

Country signal turned out to be weaker than I expected. It's mostly a proxy for "which dominant individual player happens to be from this country" rather than national style — Serbia is "long" because Djokovic plays long matches, not because Serbian tennis culture is uniquely grindy.

The leakage trap was the most important methodological choice. Resisting the temptation to use the 18 per-match stats forces the project to be intellectually honest. It's the difference between a model that looks impressive and one that's actually useful.

📁 Files in this repo

File	What it is
`atp_gb_regressor.pkl`	The pickled winning regression model (Gradient Boosting Pipeline).
`atp_classifier_winner.pkl`	The pickled winning classification model (Gradient Boosting Pipeline).
`Assignment_2_Data_science_course.ipynb`	The full Colab notebook — every step end-to-end.
`plots/`	The 10 figures referenced in this README.

▶️ How to use the saved models

import pickle, pandas as pd
from huggingface_hub import hf_hub_download

# Load the regression model
path = hf_hub_download(
    repo_id="Kogann/atp-match-duration-regressor",
    filename="atp_gb_regressor.pkl",
)
with open(path, "rb") as f:
    regressor = pickle.load(f)

# X must contain the same engineered columns the model was trained on
predictions = regressor.predict(X_new)

The classifier loads identically — just swap to atp_classifier_winner.pkl.

🔁 Reproducibility

Every split, model, and shuffle uses random_state = 42. Tooling: pandas, numpy, seaborn, matplotlib, scikit-learn. No SHAP — feature importance comes from feature_importances_ (trees) and .coef_ (linear models) only.

👤 Author

Jonathan Kogan — Intro to Data Science course, Spring 2026. Educational project; not a production system.

Downloads last month: -; Downloads are not tracked for this model. How to track