Fantasy Premier League - Player Performance Prediction

End-to-end ML pipeline predicting how a Premier League player will perform in the upcoming gameweek - both as exact FPL points (regression) and as a performance tier (classification).

Try the interactive app: FPL Player Predictor on HuggingFace Spaces

📹 Video walkthrough:


Overview

Question: Can we predict a Premier League player's FPL points for the next gameweek using only pre-match information?

Two framings:

  • Regression - predict the exact points (0, 1, 2, ..., 20+)
  • Classification - predict the performance tier (Blank / Decent / Good / Haul)

Pipeline:

  1. Clean 8 seasons of player-gameweek data
  2. EDA across three research questions
  3. Feature engineering with rolling form, fixture context, and clustering archetypes
  4. Train and compare four regression models
  5. Reframe as 4-class classification, train and compare three classifiers
  6. Export winners as pickled artifacts

Dataset

Source Kaggle - reevebarreto/fantasy-premier-league-player-data-2016-2024
Rows (raw) 184,646 player-gameweek records
Columns (raw) 37
Span 8 Premier League seasons (2016/17 - 2023/24)
Players 1,300+ unique
Teams 33 across all seasons
Target points (FPL points scored in that gameweek)
Modeling dataset (after FE) 178,808 rows × 65 columns

EDA

Points distribution

Points distribution

When you look at the points distribution, 58% of all rows are zero points - but only about 20% of those zeros come from players who actually played and scored zero. The other 80% are players who didn't appear at all. Most of the dataset is non-appearances, not bad performances. This shaped the modeling approach: the dataset is dominated by non-appearances, so any useful model needs to handle the "will they play" signal first - and only then can the form/fixture features do their work.

Q1 - Consistency vs Ceiling

Q1 consistency vs ceiling

When a player scores hauls, their variance goes up. Macro-wise there are basically two types: ceiling players who score big then disappear, and solid consistent players who put up reliable numbers without the explosive weeks. The chart shows this as a near-linear positive relationship between mean and standard deviation - the higher the ceiling, the bumpier the ride. For FPL strategy, this means there's no free lunch - you can't have a high-scoring player who is also low-variance. The choice is between ceiling and consistency, not both.

Q2 - Mean Reversion After Hauls

Q2 mean reversion

A player who just had a "hauler" might seem like they've run out of steam, and the popular FPL instinct is to "sell on the high." The data refutes that. Post-haul performance averages 3.89 / 3.81 / 4.03 across the next 3 gameweeks - all at or slightly above the haulers' own career baseline of 3.78, and roughly 1 full point above the population average of 2.93. Hauls are evidence of strong recent form, not a signal of impending regression. Players in good form tend to stay in good form, which is exactly what justifies including lagged points and rolling averages as features in the predictive models.

Q3 - Hidden Difficulty

Q3 hidden difficulty

A team's overall table position is a solid predictor of how hard they are to play against - the linear fit confirms this, and most teams cluster close to the expected line. There's no widespread pattern of "hidden difficulty," but a few outliers do break the trend. Nottingham Forest is the clearest case of a low-table team punching above its weight defensively - they're ranked around 17th on average but limit opposing FPL output far more than that rank would suggest. Manchester City sit in the same direction at the other end of the table - already an elite opponent and even tougher than expected. Norwich City are the opposite extreme: they concede more than even their bottom-of-table rank would predict. The takeaway is that opponent rank alone is a strong proxy for fixture difficulty, but team-specific corrections could capture the small outliers.


Feature engineering

The baseline linear model worked like a robot - it took dry static stats (price, position, gameweek) and hoped for the best. Football doesn't work that way. There's a momentum effect, a home/away effect, the strength of who you're playing, the role you've been given - a lot of things shape how a player will perform on any given gameweek. The engineered features target exactly that gap, in three families: rolling form (how the player has been playing lately), fixture context (who they're up against), and player archetype (what kind of player they are, learned from clustering).

Rolling form (per player, within season)

  • points_rolling_3, points_rolling_5, points_rolling_10 - rolling mean of points
  • minutes_played_rolling_3/5/10 - rolling mean of minutes
  • bps_rolling_3/5/10 - rolling mean of bonus points system score
  • points_rolling_std_10 - volatility (window=10, strict)
  • points_lag_1, minutes_lag_1 - last gameweek's stats
  • had_haul_last_3 - boolean flag

Fixture context

  • team_strength - team's rolling-5 average of player points (filtered to minutes_played > 0)
  • opponent_strength - same metric for opponent

Player archetype

  • 5 K-Means clusters from player-season aggregates, one-hot encoded

Leakage prevention

All rolling features use .shift(1) before the rolling window. Verified on Bruno Fernandes 2022/23.


Clustering

Elbow and silhouette

K=5 chosen.

Cluster Size Archetype
4 138 (2.5%) Elite attackers
0 630 (11.4%) Mid-tier attackers
2 1,540 (27.8%) Regular defensive starters
1 2,144 (38.7%) Bench / rotation
3 1,082 (19.6%) Partial-season squad

PCA clusters

The 5 archetypes pass the eye test for anyone who knows FPL. Elite attackers (Salah, Haaland tier), mid-tier attackers, defensive regulars, bench/rotation, and partial-season players - these are the categories any FPL manager already thinks in, just formalized. Nothing genuinely surprising came out of the clustering. The one thing worth noting is the size of the bench/rotation cluster: 38.7% of all player-seasons. Not surprising in retrospect - Premier League squads are deep and most names on a teamsheet aren't regular starters - but it's a useful reminder of how much of the dataset is non-core players, which connects back to the "will they play" signal that dominated the regression.


Regression

Baseline

Baseline coefficients

7 static features, default LinearRegression.

Metric Value
Test R² 0.112
Test MAE 1.48

The baseline's heavy reliance on value (coefficient 0.88, dominating everything else) shows a naive model that essentially believes "more money = more points." The R² of 0.112 proves that wrong. Price alone captures static identity - which player tier we're talking about - but it knows nothing about form, context, or the relationships between features. The takeaway is that raw stats won't get you there. The model needs engineered signals that combine information across rows (rolling form), across time (lags), and across entities (team and opponent strength) to produce predictions that actually track real performance.

Engineered models

Regression comparison

Model Test R² Test MAE
Baseline LR 0.112 1.48
Engineered LR 0.303 1.16
Random Forest 0.317 1.11
HistGB (winner) 0.318 1.11

Going from baseline LR to engineered LR added +0.19 R². Going from engineered LR all the way through HistGB added only +0.014. The story isn't model architecture - it's features. The engineered features carry mostly linear signal; non-linear interactions add little. All three engineered models converge at R² ~ 0.32, which indicates a feature ceiling rather than a model ceiling. No amount of tuning, ensembling, or model swapping pushed past it. To go higher, the project would need fundamentally new signals - confirmed starting lineups, market odds, weather - rather than more sophisticated algorithms.

Feature importance and winner diagnostics

Feature importance

Feature engineering paid off, with the engineered features dominating the top of the importance chart - 4 of the 5 cluster features appear in the top 15, validating the clustering work as a useful predictive input rather than just descriptive analysis. team_strength and opponent_strength also crack the top 6, confirming that fixture context carries real signal. minutes_lag_1 accounting for ~53% of importance is consistent with what we already saw in the EDA - most rows in the dataset are non-appearances, so "did the player play last week" carries enormous predictive weight. If a player played last week, there's a high probability they'll play this week, and just being on the pitch is what unlocks any chance of scoring points at all.

HistGB diagnostics

The flip side is what the diagnostics chart shows: the model predicts well in the 0-7 point range but never reaches the 10+ haul tail. Hauls are 1.9% of the data and depend on irreducibly random events (whether a shot finds the corner) that no pre-match feature can capture. Feature engineering got the model from "naive" to "useful," but the haul ceiling is a fundamental data limit, not a tuning problem.


Classification

Class Range Train %
Blank 0-1 points 73.4%
Decent 2-4 points 16.4%
Good 5-9 points 8.3%
Haul 10+ points 1.9%

Class distribution

73% of the dataset is the Blank class, which means a naive model that always predicts "Blank" would score 73% accuracy without making any real prediction. Accuracy is misleading on this kind of imbalanced data. The metrics that actually matter are macro-F1 (treats all classes equally regardless of size) and Haul recall (what fraction of real hauls did we catch). Recall matters more than precision here because in FPL, missing a real haul is a much bigger loss than getting a false alarm. A False Negative means benching a player who explodes; a False Positive just means captaining a player who underperforms. Hence the imbalance was handled with class_weight='balanced' (or equivalent sample_weight for HistGB) and the Haul recall was chosen as the headline metric.

Models

Three classifiers, all with class weighting:

  • LogisticRegression (class_weight='balanced')
  • RandomForestClassifier (class_weight='balanced')
  • HistGradientBoostingClassifier (equivalent sample_weight)
Confusion matrices
Model Accuracy Macro-F1 Haul precision Haul recall Haul F1
Logistic Regression (winner) 0.677 0.438 0.088 0.489 0.149
Random Forest 0.684 0.449 0.108 0.283 0.157
HistGB 0.666 0.437 0.088 0.435 0.147

All three models had similar accuracy and macro-F1 - they're capturing roughly the same overall signal. But the goal here was Haul recall, and LogReg wins clearly on that with 0.489 vs RF's 0.283 and HistGB's 0.435. The FPL logic backs this up: it's better to bet on a player who looks like a haul candidate and be wrong (a False Positive - captain a player who doesn't deliver) than to miss a real haul (a False Negative - bench a player who explodes). A model that's willing to take more shots at flagging hauls is what FPL actually needs, even if more of those shots miss. LogReg flags hauls more often than the tree models do, and that's why it wins.


Files

File Purpose
fpl_regression_model.pkl HistGB regressor + scaler + feature names
fpl_classification_model.pkl LogReg classifier + scaler + class labels
df_fe.parquet Engineered dataset (used by the interactive Space)
notebook.ipynb Full notebook
images/ All graphs from this README

Challenges and lessons

The Andrew Robertson mystery. During EDA, Andrew Robertson appeared twice with different stats - 121 games as a midfielder and 181 games as a defender. At first glance this looked like two different players sharing a name. Investigating it revealed the actual issue: FPL reassigns the element ID per season, and Robertson's category had been changed across seasons, so the dataset was treating the same player as two separate entities. The fix was switching to (name, position) as a stable cross-season identifier. A small bug that, if missed, would have quietly polluted every per-player aggregate in the project.

The team strength bench-warmer confound. The first version of team_strength was an honest average of all team players' points - which dragged the metric down for any squad with a lot of rotation, because bench warmers contributing 0-1 points were counted equally with starters. The fix was filtering to minutes_played > 0 so the metric reflected what an actual matchday XI produces. Easy to miss without sanity-checking on real teams.

The R² ceiling at 0.32. All three engineered models converged at the same R², which is a clear signal that the limit was the features, not the algorithm. Before accepting that, we ran a full GridSearchCV across 27 hyperparameter combinations on the winning HistGB model - 135 model fits in total - and the tuned version landed at R² 0.3176 vs the default's 0.3178. Tuning made the model imperceptibly worse, which is the strongest possible evidence that there's no signal left to extract from these features. Accepting a ~3% data loss from dropping cold-start rows (early-season gameweeks where rolling features couldn't compute) was a deliberate trade-off chosen for honesty over more rows. The instinct to throw more sophisticated models at the ceiling (XGBoost, stacking, ensembles) is strong but would have been a waste of time. The honest read was that to push past this, the project would need fundamentally new signals - confirmed lineups, market odds, weather - not more clever modeling.

Class imbalance forced rethinking the metric. 73% of the dataset is the Blank class, so a model that always predicts Blank scores 73% accuracy without making any real prediction. That single fact changed the whole framing: accuracy is misleading, macro-F1 is more honest, and Haul recall is the metric that actually matters for FPL value. Picking the right metric was as important as picking the right model.

The haul prediction ceiling. Even the winning regression model never predicts in the 10+ haul range. This isn't a bug - it's a property of the data. Hauls depend on irreducibly random events (whether a shot finds the corner, whether a deflection lands kindly) that no pre-match feature can capture. Recognizing that early made it easier to accept the ceiling and focus on what the model can actually do well.


Extra work

  • Advanced hyperparameter tuning. GridSearchCV across 27 hyperparameter combinations on HistGB (135 model fits with 5-fold CV) to confirm the R² ceiling was a feature limit, not a model limit. Result: tuned R² = 0.3176 vs default 0.3178 - imperceptibly worse, confirming no signal was left to extract.
  • Interactive HuggingFace Space. A deployed Streamlit app that lets users pick any (player, season, gameweek) from the historical dataset and see what the model would have predicted, with a plain-English match preview, a feature contribution panel explaining what's pushing the prediction up or down, and a search link to the fixture's highlights on YouTube. Demonstrates the pickle bundles work end-to-end and lets the grader interact with the model directly. Live at allenborochin/fpl-predictor.

Reproducibility

  • All random operations seeded with SEED = 42
  • Runs end-to-end on Colab free tier (no GPU needed)
  • K-Means deterministic with random_state=SEED, n_init=10

Usage

import pickle

with open('fpl_regression_model.pkl', 'rb') as f:
    bundle = pickle.load(f)

model = bundle['model']
scaler = bundle['scaler']
features = bundle['feature_names']

X_scaled = scaler.transform(X[features])
predictions = model.predict(X_scaled)

Same pattern for fpl_classification_model.pkl - swap in the class labels from the bundle to map predictions back to tier names.


Author

Allen Borochin

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using allenborochin/fpl_player_predictive_modeling 1