Hyrox Race Time Prediction

At a glance — the whole project in one list

Loaded ~92,000 Hyrox race results from Kaggle (jgug05/hyrox-results) via kagglehub.
Cleaned the data — parsed every time column from H:MM:SS strings to seconds, filled missing nationality with Unknown, dropped the small fraction of rows missing age_group, and removed outliers using a domain bound of 45–180 minutes.
Explored the dataset with seven visualisations: total-time distribution, gender comparison, performance by age band, variability across the eight workout stations, fatigue across the eight runs (animated), the full correlation heatmap, and per-station coefficient of variation.
Recognised the leakage trap — total_time ≈ run_time + work_time + roxzone_time, so feeding raw splits into the regressor would push R² to ~1.0 and produce a meaningless model. Splits are used only for clustering.
Built a baseline regressor — a plain Linear Regression on demographics only (gender + age band + division), which scored R² = 0.16 and an MAE of 11.6 minutes.
Engineered five new features — numeric age, is_male, year, region (continent grouping of nationality), and event_size.
Clustered athletes into four archetypes with KMeans (k = 4) on the 16 split features. The cluster ID and cluster_distance become two new features.
Trained three improved regressors — Linear Regression, Random Forest, Gradient Boosting — on the engineered feature set. Gradient Boosting won with R² = 0.834 and MAE of 5.35 minutes — about half the baseline error.
Reframed the task as classification — split total_time at the field median (84.9 minutes) to produce two perfectly balanced classes. Trained three classifiers; Gradient Boosting again won, with F1 = 0.892 and 88.7% accuracy on the slow class.
Trained two demo models without cluster features — a Gradient Boosting regressor and a Logistic Regression classifier targeting total_time < 90 min — so a first-time competitor can use the live predictor without race history.
Pickled every model, saved every plot, and shipped a Gradio app to a Hugging Face Space for the live "will I break 90 minutes?" demo.

Project Overview

Hyrox is a hybrid fitness race that has exploded in popularity in recent years. Every event runs the same fixed format: eight 1-kilometre runs alternating with eight strength-and-conditioning stations (SkiErg, Sled Push, Sled Pull, Burpee Broad Jumps, Rowing, Farmer's Carry, Sandbag Lunges, Wall Balls). Athletes compete across age groups and divisions, with results timed to the second for every individual segment.

That fixed structure makes it an unusually clean dataset for predictive modelling. Every race produces the same 28 time columns (1 total + 3 aggregates + 24 splits across runs / stations / transitions), plus a small handful of demographic columns. The modelling question is twofold:

Regression — given an athlete's demographics and the event they're racing in, how long will they take to finish?
Classification — will the athlete finish faster or slower than the field median? And in the live demo, will they break the 90-minute mark?

The project is also a case study in avoiding data leakage. The split times in the dataset literally sum to the total time, so a naive regressor that uses splits as features would get a perfect R² and learn nothing. The whole methodology is designed around that constraint.

Dataset


Source	jgug05/hyrox-results on Kaggle
Size after cleaning	91,508 rows × 34 columns
Time columns	`total_time` + 3 aggregates + 24 splits
Demographic columns	gender, age_group, division, nationality
Event metadata	event_name (which encodes city + year)

Cleaning steps applied

Time parsing. Every time column comes in as a string like "0:59:07". A small parser converts H:MM:SS and MM:SS formats into seconds and stashes the result in new _sec columns.
Missing values. nationality is missing for roughly a third of rows — too many to drop, so those become a new Unknown category. age_group is missing for less than 1%, so those rows are dropped. Any row whose total_time couldn't be parsed is dropped too, since the regression target wouldn't be defined for it.
Outlier removal. Pure IQR would clip the long tail of slower recreational athletes, who are real and valid. Instead, the bound is domain-informed: anything outside 45–180 minutes is almost certainly a timing error or a DNF.

Exploratory Findings

Q1: How is total finish time distributed?

The distribution is right-skewed — most athletes finish between roughly 60 and 120 minutes, with a thin tail of slower racers stretching out toward the 3-hour cutoff. The mean sits noticeably to the right of the median (84.9 minutes), as expected for a distribution with a long upper tail.

Q2: Do men and women finish in different times?

Hyrox is a strength-and-endurance hybrid, and the gender effect is one of the strongest single signals in the dataset — men are faster on average across every age band, division, and event size. This shows up later as a large negative coefficient on is_male in every regression model and as a top-ranked feature in the winning classifier.

Q3: How does performance change with age?

Age groups are sorted in ascending order rather than alphabetically (which would put 35–39 before 18–24). Peak athletic performance sits in the late 20s / early 30s, with a gentle but clear decline through the older age groups.

Q4: Which station separates strong athletes from weaker ones?

Each box represents the distribution of times athletes spent at one station. Absolute values aren't directly comparable (sandbag lunges take longer than burpees in seconds), so the more informative metric is the coefficient of variation (standard deviation divided by mean). Wall balls and sandbag lunges have the highest CV, meaning that's where individual fitness differences matter most.

Q5: Does fatigue accumulate across the eight runs? (animated)

Each frame adds the next run's average time and standard deviation. You can watch run times creep up across the eight segments — the cumulative fatigue effect made visible. By run 8, the average athlete is roughly 30–60 seconds slower than they were on run 1.

Q6: How are the time components correlated?

The deep red diagonal block at the top-left isn't a surprise — total_time is the sum of run_time, work_time, and roxzone_time by construction. The interesting signal lives between the run splits and between the station splits. The 8 run columns are very strongly correlated with each other, and the 8 station columns are strongly correlated with each other, but the cross-block correlations (run vs. station) are weaker. That's evidence of stable "runner type" and "lifter type" personas — exactly the signal KMeans extracts in Step 3.

The Leakage Trap (the central design decision)

total_time is mathematically the sum of run_time, work_time, and roxzone_time. Equivalently, it's the sum of all 24 split times. Throwing those splits into the regression as features would push R² to ~1.0 — the model would just learn a near-trivial linear combination of them. The metric would look fantastic, but the model would be useless for any practical prediction.

The fix: use the splits only for clustering, then feed the resulting cluster ID back as a single, summary feature. That way the supervised models still benefit from the rich information embedded in the split structure, but they never see anything that would let them reverse-engineer the target.

Step 1 — Baseline Regression

Before adding any clever features, the project establishes a floor: a plain Linear Regression on demographics only (gender + age band + division, all one-hot encoded). The point of the baseline isn't to be good — it's to be the bar every later model has to clear.

Metric	Value
MAE	11.58 minutes
RMSE	15.14 minutes
R²	0.160

Demographics alone explain only 16% of finish-time variance. Half the field is within 11.6 minutes of the prediction; the other half is further. Plenty of room to improve.

Step 2 — Feature Engineering

Five engineered features are added on top of the demographic baseline:

Feature	What it captures	How it's built
`age_numeric`	Continuous age signal	Midpoint of each age band — `25-29` becomes `27`, `40-44` becomes `42`
`is_male`	Binary gender flag	`1` if `gender == 'male'`, else `0`
`year`	Temporal trend	Regex extracts the 4-digit year from `event_name`
`region`	Geographic effect, low-cardinality	Continental grouping — Europe / North America / Asia / Oceania / South America / Africa / Other
`event_size`	Event-scale proxy	Number of athletes at each event

The region feature is the most useful design choice. Raw nationality has 100+ unique values, almost all appearing in only a handful of rows. Grouping into seven continental buckets keeps the geographic signal without the dimensionality cost.

Step 3 — Clustering Athletes Into Archetypes

KMeans with k = 4 is run on the 16 split-time features (8 runs + 8 stations). Standardisation matters here — KMeans is distance-based and we don't want runs (which take longer in absolute seconds) to dominate stations purely because of scale.

Visualising the clusters in 2D

PCA reduces the 16-dimensional split space to two components. PC1 (the horizontal axis) captures most of the variance and reads as an "overall speed" axis. PC2 captures additional variance and reads as a "run-strong vs. lift-strong" trade-off axis.

What the four clusters mean

After fitting, each cluster's mean total / run / work time tells you exactly which archetype it represents:

Cluster	Athletes	Mean total	Mean run	Mean work	Archetype
1	35,217	73.0 min	38.6 min	28.9 min	Elite — uniformly fast across runs and stations
2	20,005	88.4 min	50.8 min	30.3 min	Lift-strong, run-weak — relatively fast at stations, slower runs
3	26,579	93.8 min	45.4 min	41.0 min	Run-strong, lift-weak — relatively fast runs, slower at stations
0	9,707	120.3 min	62.7 min	47.8 min	Recreational — uniformly slow across runs and stations

The cluster ID becomes a single feature in the supervised models, while cluster_distance (Euclidean distance to the assigned centroid) is a continuous "how typical of this archetype are you" score.

Step 4 — Three Regression Models on the Engineered Features

With the engineered feature set in hand (demographics + the five new features + cluster + cluster_distance), three regressors are trained on the same 80% training split and evaluated on the same held-out 20% test set.

Model A — Linear Regression

The same algorithm as the baseline, but now on the richer feature set. It assumes the target is a linear combination of the features and fits the coefficients by ordinary least squares.

Model B — Random Forest

An ensemble of 100 decision trees. Each tree sees a random subset of rows and a random subset of features, so the trees are decorrelated; the final prediction is the average of all 100 trees. Trees automatically handle non-linearities and feature interactions.

Model C — Gradient Boosting

200 trees trained sequentially, each one trained on the residual errors of the ensemble built so far. Many shallow trees (max_depth=5) plus a small learning_rate=0.05 lets the model fit subtle patterns gradually.

Comparison

Model	MAE (min)	RMSE (min)	R²
Baseline LR (demographics only)	11.58	15.14	0.160
Linear Regression (engineered)	9.10	11.82	0.488
Random Forest	5.30	6.77	0.832
Gradient Boosting ← winner	5.35	6.74	0.834

Winning regressor: Gradient Boosting, with R² = 0.834 — about five times the baseline. MAE is more than halved, dropping from 11.58 minutes (baseline) to 5.35 minutes. Note that Random Forest and Gradient Boosting are essentially tied on this dataset; GB edges out by a fraction of a percent on R² and on RMSE.

The biggest single jump in the table is from R² = 0.160 to 0.488 — same Linear Regression algorithm, just with the engineered features and the cluster ID. That's the engineering doing its work; the additional gain to 0.83 comes from non-linear interactions the tree models can capture but linear can't.

Which features matter most in the winning regressor?

The feature importance plot tells the whole story of why the engineering worked:

cluster — by far the most important feature, with importance several times the next-largest.
cluster_distance — second.
division_open, region_Other, event_size, age_numeric, gender_male, is_male, year — the next tier.

The two cluster-derived features alone account for the lion's share of the model's predictive power. That's strong evidence the leakage-avoidance design works: by routing the rich split information through a clustering step, we extract the predictive signal without letting the regressor reverse-engineer the target.

Step 5 — Reframing as Classification

The continuous target is reframed as a binary one by splitting total_time at the field median = 84.9 minutes. This produces perfectly balanced classes (50/50): "fast" (below median) and "slow" (at or above median). The balance matters because it means raw accuracy is meaningful, no resampling is needed, and no class-weight tricks are required.

Precision vs. recall — which matters more?

For this dataset, the costs of the two error types are roughly symmetric, so F1 on the slow class is the primary selection metric.

That said, in a coaching use case (flagging athletes who are likely to struggle) the costs become asymmetric: missing a struggler (false negative) defeats the purpose, while a false alarm (false positive) just means giving extra attention to someone who didn't strictly need it. False negatives are the more critical error in that framing, and the appropriate metric becomes recall on the slow class.

Step 6 — Three Classification Models

Same engineered feature set as the regression. Same 80/20 split, but the target is now is_slow (1 if at or above median, else 0) and train_test_split uses stratify=y_class to preserve the 50/50 balance.

Classifier A — Logistic Regression

Predicts the log-odds of class 1 as a weighted sum of the (scaled) input features and squashes through a sigmoid to produce a probability. Fast, interpretable, well-calibrated probabilities.

Classifier B — Random Forest Classifier

Same architecture as the Random Forest regressor, but each leaf now votes for class 0 or class 1 instead of predicting a continuous value. 200 trees instead of 100.

Classifier C — Gradient Boosting Classifier

Boosting variant for binary classification. Same hyperparameters as the GBM regressor (200 trees, depth 5, lr 0.05).

Comparison

Model	Precision (slow)	Recall (slow)	F1 (slow)	Accuracy
Logistic Regression	0.817	0.787	0.802	0.805
Random Forest	0.857	0.930	0.892	0.887
Gradient Boosting ← winner	0.853	0.934	0.892	0.887

Winning classifier: Gradient Boosting, with F1 = 0.892 on the slow class and 88.7% overall accuracy. As with the regression, Random Forest and Gradient Boosting are nearly tied — both crush Logistic Regression on this task by ~9 F1 points, again because the cluster feature interacts non-linearly with division and age in ways linear models can't capture.

The GB classifier catches 93.4% of slow finishers (recall) at 85.3% precision — a strong recall-leaning balance, which is exactly what the coaching use case wants.

Step 7 — The Demo Models

The headline regressor uses two cluster-based features (cluster, cluster_distance) derived from a runner's past race splits. A first-time competitor doesn't have those splits yet, so the live Space couldn't ask for them in a form. Two demo models are trained on a slimmer feature set — demographics + event metadata only, no cluster information:

A Gradient Boosting regressor for the predicted finish time. Without cluster info, MAE rises from 5.35 to 11.09 minutes, and R² drops from 0.834 to 0.228 — that's the price of usability for the live demo.
A Logistic Regression classifier trained directly on the binary target total_time < 90 min. The class balance is roughly 63/37 (most athletes finish under 90 min in this dataset), and the model achieves 67% accuracy with F1 = 0.76 on the under-90 class — a reasonable calibrated probability for the live demo.

This is exactly the kind of "production vs. analytical model" decision real ML teams make: there's an offline model that uses everything you know about the user, and a real-time model that uses only the inputs you can collect at request time.

Files in This Repository

File	What it is
`hyrox_assignment.ipynb`	The full notebook — EDA, modelling, and evaluation, end to end.
`hyrox_notebook_walkthrough.pdf`	Cell-by-cell explanation of the notebook.
`hyrox_regressor.pkl`	The pickled winning regression model (Gradient Boosting, uses cluster features).
`hyrox_regressor_features.pkl`	The list of feature columns the regressor expects.
`hyrox_classifier.pkl`	The pickled winning classifier (Gradient Boosting).
`hyrox_classifier_features.pkl`	The list of feature columns the classifier expects.
`plots/`	All 11 figures: 10 PNGs + the fatigue GIF.
`app.py` + `requirements.txt`	The Gradio Space app.
`hyrox_demo_*.pkl`	Five smaller demo artifacts used by the live Space (no cluster features).

How to Use the Saved Models

import pickle
import pandas as pd

with open("hyrox_regressor.pkl", "rb") as f:
    regressor = pickle.load(f)
with open("hyrox_regressor_features.pkl", "rb") as f:
    feature_columns = pickle.load(f)

# Build a single-row DataFrame with the same engineered columns the model was trained on,
# then reindex to guarantee the column order matches.
new_athlete = pd.DataFrame([{
    "age_numeric": 27,
    "is_male": 1,
    "year": 2024,
    "event_size": 1500,
    "cluster": 1,            # 1 = elite, 2 = lift-strong, 3 = run-strong, 0 = recreational
    "cluster_distance": 1.4,
    # plus the one-hot columns from gender / age_group / division / region
}]).reindex(columns=feature_columns, fill_value=0)

predicted_seconds = regressor.predict(new_athlete)[0]
print(f"Predicted finish time: {predicted_seconds / 60:.1f} minutes")

The classifier loads the same way; just swap in hyrox_classifier.pkl and call predict() to get 0 (fast) or 1 (slow).

Reproducibility

All randomness is seeded with SEED = 42 (Python random, NumPy, PYTHONHASHSEED, every scikit-learn estimator). Re-running the notebook on the same data produces identical metrics and identical pickle files. The Hugging Face Space is pinned to Python 3.11 with scikit-learn 1.6.1, numpy 2.0.2, and pandas 2.2.2 — the exact same environment Colab used when the pickles were created.

Live Predictor — Will I break 90 minutes?

A Gradio app running on Hugging Face Spaces lets you input demographics and event details and get back (a) an estimated total finish time and (b) the probability you'll come in under 90 minutes.

Author

This project was completed by Michael Gelshtein as part of a Data Science course at Reichman University. It is intended for educational purposes only — the predictions are a demonstration of supervised-learning techniques on a sport dataset, not professional training advice.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support