Hyrox Race Time Prediction
At a glance β the whole project in one list
- Loaded ~92,000 Hyrox race results from Kaggle (
jgug05/hyrox-results) viakagglehub. - Cleaned the data β parsed every time column from
H:MM:SSstrings to seconds, filled missingnationalitywithUnknown, dropped the small fraction of rows missingage_group, and removed outliers using a domain bound of 45β180 minutes. - Explored the dataset with seven visualisations: total-time distribution, gender comparison, performance by age band, variability across the eight workout stations, fatigue across the eight runs (animated), the full correlation heatmap, and per-station coefficient of variation.
- Recognised the leakage trap β
total_time β run_time + work_time + roxzone_time, so feeding raw splits into the regressor would push RΒ² to ~1.0 and produce a meaningless model. Splits are used only for clustering. - Built a baseline regressor β a plain Linear Regression on demographics only (gender + age band + division), which scored RΒ² = 0.16 and an MAE of 11.6 minutes.
- Engineered five new features β numeric age,
is_male,year,region(continent grouping of nationality), andevent_size. - Clustered athletes into four archetypes with KMeans (k = 4) on the 16 split features. The cluster ID and
cluster_distancebecome two new features. - Trained three improved regressors β Linear Regression, Random Forest, Gradient Boosting β on the engineered feature set. Gradient Boosting won with RΒ² = 0.834 and MAE of 5.35 minutes β about half the baseline error.
- Reframed the task as classification β split
total_timeat the field median (84.9 minutes) to produce two perfectly balanced classes. Trained three classifiers; Gradient Boosting again won, with F1 = 0.892 and 88.7% accuracy on the slow class. - Trained two demo models without cluster features β a Gradient Boosting regressor and a Logistic Regression classifier targeting
total_time < 90 minβ so a first-time competitor can use the live predictor without race history. - Pickled every model, saved every plot, and shipped a Gradio app to a Hugging Face Space for the live "will I break 90 minutes?" demo.
Project Overview
Hyrox is a hybrid fitness race that has exploded in popularity in recent years. Every event runs the same fixed format: eight 1-kilometre runs alternating with eight strength-and-conditioning stations (SkiErg, Sled Push, Sled Pull, Burpee Broad Jumps, Rowing, Farmer's Carry, Sandbag Lunges, Wall Balls). Athletes compete across age groups and divisions, with results timed to the second for every individual segment.
That fixed structure makes it an unusually clean dataset for predictive modelling. Every race produces the same 28 time columns (1 total + 3 aggregates + 24 splits across runs / stations / transitions), plus a small handful of demographic columns. The modelling question is twofold:
- Regression β given an athlete's demographics and the event they're racing in, how long will they take to finish?
- Classification β will the athlete finish faster or slower than the field median? And in the live demo, will they break the 90-minute mark?
The project is also a case study in avoiding data leakage. The split times in the dataset literally sum to the total time, so a naive regressor that uses splits as features would get a perfect RΒ² and learn nothing. The whole methodology is designed around that constraint.
Dataset
| Source | jgug05/hyrox-results on Kaggle |
| Size after cleaning | 91,508 rows Γ 34 columns |
| Time columns | total_time + 3 aggregates + 24 splits |
| Demographic columns | gender, age_group, division, nationality |
| Event metadata | event_name (which encodes city + year) |
Cleaning steps applied
- Time parsing. Every time column comes in as a string like
"0:59:07". A small parser convertsH:MM:SSandMM:SSformats into seconds and stashes the result in new_seccolumns. - Missing values.
nationalityis missing for roughly a third of rows β too many to drop, so those become a newUnknowncategory.age_groupis missing for less than 1%, so those rows are dropped. Any row whosetotal_timecouldn't be parsed is dropped too, since the regression target wouldn't be defined for it. - Outlier removal. Pure IQR would clip the long tail of slower recreational athletes, who are real and valid. Instead, the bound is domain-informed: anything outside 45β180 minutes is almost certainly a timing error or a DNF.
Exploratory Findings
Q1: How is total finish time distributed?
The distribution is right-skewed β most athletes finish between roughly 60 and 120 minutes, with a thin tail of slower racers stretching out toward the 3-hour cutoff. The mean sits noticeably to the right of the median (84.9 minutes), as expected for a distribution with a long upper tail.
Q2: Do men and women finish in different times?
Hyrox is a strength-and-endurance hybrid, and the gender effect is one of the strongest single signals in the dataset β men are faster on average across every age band, division, and event size. This shows up later as a large negative coefficient on is_male in every regression model and as a top-ranked feature in the winning classifier.
Q3: How does performance change with age?
Age groups are sorted in ascending order rather than alphabetically (which would put 35β39 before 18β24). Peak athletic performance sits in the late 20s / early 30s, with a gentle but clear decline through the older age groups.
Q4: Which station separates strong athletes from weaker ones?
Each box represents the distribution of times athletes spent at one station. Absolute values aren't directly comparable (sandbag lunges take longer than burpees in seconds), so the more informative metric is the coefficient of variation (standard deviation divided by mean). Wall balls and sandbag lunges have the highest CV, meaning that's where individual fitness differences matter most.
Q5: Does fatigue accumulate across the eight runs? (animated)
Each frame adds the next run's average time and standard deviation. You can watch run times creep up across the eight segments β the cumulative fatigue effect made visible. By run 8, the average athlete is roughly 30β60 seconds slower than they were on run 1.
Q6: How are the time components correlated?
The deep red diagonal block at the top-left isn't a surprise β total_time is the sum of run_time, work_time, and roxzone_time by construction. The interesting signal lives between the run splits and between the station splits. The 8 run columns are very strongly correlated with each other, and the 8 station columns are strongly correlated with each other, but the cross-block correlations (run vs. station) are weaker. That's evidence of stable "runner type" and "lifter type" personas β exactly the signal KMeans extracts in Step 3.
The Leakage Trap (the central design decision)
total_time is mathematically the sum of run_time, work_time, and roxzone_time. Equivalently, it's the sum of all 24 split times. Throwing those splits into the regression as features would push RΒ² to ~1.0 β the model would just learn a near-trivial linear combination of them. The metric would look fantastic, but the model would be useless for any practical prediction.
The fix: use the splits only for clustering, then feed the resulting cluster ID back as a single, summary feature. That way the supervised models still benefit from the rich information embedded in the split structure, but they never see anything that would let them reverse-engineer the target.
Step 1 β Baseline Regression
Before adding any clever features, the project establishes a floor: a plain Linear Regression on demographics only (gender + age band + division, all one-hot encoded). The point of the baseline isn't to be good β it's to be the bar every later model has to clear.
| Metric | Value |
|---|---|
| MAE | 11.58 minutes |
| RMSE | 15.14 minutes |
| RΒ² | 0.160 |
Demographics alone explain only 16% of finish-time variance. Half the field is within 11.6 minutes of the prediction; the other half is further. Plenty of room to improve.
Step 2 β Feature Engineering
Five engineered features are added on top of the demographic baseline:
| Feature | What it captures | How it's built |
|---|---|---|
age_numeric |
Continuous age signal | Midpoint of each age band β 25-29 becomes 27, 40-44 becomes 42 |
is_male |
Binary gender flag | 1 if gender == 'male', else 0 |
year |
Temporal trend | Regex extracts the 4-digit year from event_name |
region |
Geographic effect, low-cardinality | Continental grouping β Europe / North America / Asia / Oceania / South America / Africa / Other |
event_size |
Event-scale proxy | Number of athletes at each event |
The region feature is the most useful design choice. Raw nationality has 100+ unique values, almost all appearing in only a handful of rows. Grouping into seven continental buckets keeps the geographic signal without the dimensionality cost.
Step 3 β Clustering Athletes Into Archetypes
KMeans with k = 4 is run on the 16 split-time features (8 runs + 8 stations). Standardisation matters here β KMeans is distance-based and we don't want runs (which take longer in absolute seconds) to dominate stations purely because of scale.
Visualising the clusters in 2D
PCA reduces the 16-dimensional split space to two components. PC1 (the horizontal axis) captures most of the variance and reads as an "overall speed" axis. PC2 captures additional variance and reads as a "run-strong vs. lift-strong" trade-off axis.
What the four clusters mean
After fitting, each cluster's mean total / run / work time tells you exactly which archetype it represents:
| Cluster | Athletes | Mean total | Mean run | Mean work | Archetype |
|---|---|---|---|---|---|
| 1 | 35,217 | 73.0 min | 38.6 min | 28.9 min | Elite β uniformly fast across runs and stations |
| 2 | 20,005 | 88.4 min | 50.8 min | 30.3 min | Lift-strong, run-weak β relatively fast at stations, slower runs |
| 3 | 26,579 | 93.8 min | 45.4 min | 41.0 min | Run-strong, lift-weak β relatively fast runs, slower at stations |
| 0 | 9,707 | 120.3 min | 62.7 min | 47.8 min | Recreational β uniformly slow across runs and stations |
The cluster ID becomes a single feature in the supervised models, while cluster_distance (Euclidean distance to the assigned centroid) is a continuous "how typical of this archetype are you" score.
Step 4 β Three Regression Models on the Engineered Features
With the engineered feature set in hand (demographics + the five new features + cluster + cluster_distance), three regressors are trained on the same 80% training split and evaluated on the same held-out 20% test set.
Model A β Linear Regression
The same algorithm as the baseline, but now on the richer feature set. It assumes the target is a linear combination of the features and fits the coefficients by ordinary least squares.
Model B β Random Forest
An ensemble of 100 decision trees. Each tree sees a random subset of rows and a random subset of features, so the trees are decorrelated; the final prediction is the average of all 100 trees. Trees automatically handle non-linearities and feature interactions.
Model C β Gradient Boosting
200 trees trained sequentially, each one trained on the residual errors of the ensemble built so far. Many shallow trees (max_depth=5) plus a small learning_rate=0.05 lets the model fit subtle patterns gradually.
Comparison
| Model | MAE (min) | RMSE (min) | RΒ² |
|---|---|---|---|
| Baseline LR (demographics only) | 11.58 | 15.14 | 0.160 |
| Linear Regression (engineered) | 9.10 | 11.82 | 0.488 |
| Random Forest | 5.30 | 6.77 | 0.832 |
| Gradient Boosting β winner | 5.35 | 6.74 | 0.834 |
Winning regressor: Gradient Boosting, with RΒ² = 0.834 β about five times the baseline. MAE is more than halved, dropping from 11.58 minutes (baseline) to 5.35 minutes. Note that Random Forest and Gradient Boosting are essentially tied on this dataset; GB edges out by a fraction of a percent on RΒ² and on RMSE.
The biggest single jump in the table is from RΒ² = 0.160 to 0.488 β same Linear Regression algorithm, just with the engineered features and the cluster ID. That's the engineering doing its work; the additional gain to 0.83 comes from non-linear interactions the tree models can capture but linear can't.
Which features matter most in the winning regressor?
The feature importance plot tells the whole story of why the engineering worked:
clusterβ by far the most important feature, with importance several times the next-largest.cluster_distanceβ second.division_open,region_Other,event_size,age_numeric,gender_male,is_male,yearβ the next tier.
The two cluster-derived features alone account for the lion's share of the model's predictive power. That's strong evidence the leakage-avoidance design works: by routing the rich split information through a clustering step, we extract the predictive signal without letting the regressor reverse-engineer the target.
Step 5 β Reframing as Classification
The continuous target is reframed as a binary one by splitting total_time at the field median = 84.9 minutes. This produces perfectly balanced classes (50/50): "fast" (below median) and "slow" (at or above median). The balance matters because it means raw accuracy is meaningful, no resampling is needed, and no class-weight tricks are required.
Precision vs. recall β which matters more?
For this dataset, the costs of the two error types are roughly symmetric, so F1 on the slow class is the primary selection metric.
That said, in a coaching use case (flagging athletes who are likely to struggle) the costs become asymmetric: missing a struggler (false negative) defeats the purpose, while a false alarm (false positive) just means giving extra attention to someone who didn't strictly need it. False negatives are the more critical error in that framing, and the appropriate metric becomes recall on the slow class.
Step 6 β Three Classification Models
Same engineered feature set as the regression. Same 80/20 split, but the target is now is_slow (1 if at or above median, else 0) and train_test_split uses stratify=y_class to preserve the 50/50 balance.
Classifier A β Logistic Regression
Predicts the log-odds of class 1 as a weighted sum of the (scaled) input features and squashes through a sigmoid to produce a probability. Fast, interpretable, well-calibrated probabilities.
Classifier B β Random Forest Classifier
Same architecture as the Random Forest regressor, but each leaf now votes for class 0 or class 1 instead of predicting a continuous value. 200 trees instead of 100.
Classifier C β Gradient Boosting Classifier
Boosting variant for binary classification. Same hyperparameters as the GBM regressor (200 trees, depth 5, lr 0.05).
Comparison
| Model | Precision (slow) | Recall (slow) | F1 (slow) | Accuracy |
|---|---|---|---|---|
| Logistic Regression | 0.817 | 0.787 | 0.802 | 0.805 |
| Random Forest | 0.857 | 0.930 | 0.892 | 0.887 |
| Gradient Boosting β winner | 0.853 | 0.934 | 0.892 | 0.887 |
Winning classifier: Gradient Boosting, with F1 = 0.892 on the slow class and 88.7% overall accuracy. As with the regression, Random Forest and Gradient Boosting are nearly tied β both crush Logistic Regression on this task by ~9 F1 points, again because the cluster feature interacts non-linearly with division and age in ways linear models can't capture.
The GB classifier catches 93.4% of slow finishers (recall) at 85.3% precision β a strong recall-leaning balance, which is exactly what the coaching use case wants.
Step 7 β The Demo Models
The headline regressor uses two cluster-based features (cluster, cluster_distance) derived from a runner's past race splits. A first-time competitor doesn't have those splits yet, so the live Space couldn't ask for them in a form. Two demo models are trained on a slimmer feature set β demographics + event metadata only, no cluster information:
- A Gradient Boosting regressor for the predicted finish time. Without cluster info, MAE rises from 5.35 to 11.09 minutes, and RΒ² drops from 0.834 to 0.228 β that's the price of usability for the live demo.
- A Logistic Regression classifier trained directly on the binary target
total_time < 90 min. The class balance is roughly 63/37 (most athletes finish under 90 min in this dataset), and the model achieves 67% accuracy with F1 = 0.76 on the under-90 class β a reasonable calibrated probability for the live demo.
This is exactly the kind of "production vs. analytical model" decision real ML teams make: there's an offline model that uses everything you know about the user, and a real-time model that uses only the inputs you can collect at request time.
Files in This Repository
| File | What it is |
|---|---|
hyrox_assignment.ipynb |
The full notebook β EDA, modelling, and evaluation, end to end. |
hyrox_notebook_walkthrough.pdf |
Cell-by-cell explanation of the notebook. |
hyrox_regressor.pkl |
The pickled winning regression model (Gradient Boosting, uses cluster features). |
hyrox_regressor_features.pkl |
The list of feature columns the regressor expects. |
hyrox_classifier.pkl |
The pickled winning classifier (Gradient Boosting). |
hyrox_classifier_features.pkl |
The list of feature columns the classifier expects. |
plots/ |
All 11 figures: 10 PNGs + the fatigue GIF. |
app.py + requirements.txt |
The Gradio Space app. |
hyrox_demo_*.pkl |
Five smaller demo artifacts used by the live Space (no cluster features). |
How to Use the Saved Models
import pickle
import pandas as pd
with open("hyrox_regressor.pkl", "rb") as f:
regressor = pickle.load(f)
with open("hyrox_regressor_features.pkl", "rb") as f:
feature_columns = pickle.load(f)
# Build a single-row DataFrame with the same engineered columns the model was trained on,
# then reindex to guarantee the column order matches.
new_athlete = pd.DataFrame([{
"age_numeric": 27,
"is_male": 1,
"year": 2024,
"event_size": 1500,
"cluster": 1, # 1 = elite, 2 = lift-strong, 3 = run-strong, 0 = recreational
"cluster_distance": 1.4,
# plus the one-hot columns from gender / age_group / division / region
}]).reindex(columns=feature_columns, fill_value=0)
predicted_seconds = regressor.predict(new_athlete)[0]
print(f"Predicted finish time: {predicted_seconds / 60:.1f} minutes")
The classifier loads the same way; just swap in hyrox_classifier.pkl and call predict() to get 0 (fast) or 1 (slow).
Reproducibility
All randomness is seeded with SEED = 42 (Python random, NumPy, PYTHONHASHSEED, every scikit-learn estimator). Re-running the notebook on the same data produces identical metrics and identical pickle files. The Hugging Face Space is pinned to Python 3.11 with scikit-learn 1.6.1, numpy 2.0.2, and pandas 2.2.2 β the exact same environment Colab used when the pickles were created.
Live Predictor β Will I break 90 minutes?
A Gradio app running on Hugging Face Spaces lets you input demographics and event details and get back (a) an estimated total finish time and (b) the probability you'll come in under 90 minutes.
Author
This project was completed by Michael Gelshtein as part of a Data Science course at Reichman University. It is intended for educational purposes only β the predictions are a demonstration of supervised-learning techniques on a sport dataset, not professional training advice.










