(Right! Luxury!) Lakehouse
12-page soccer analytics dashboard on Lakebase
Soccer analytics, sports analytics, player embeddings, pitch control, action valuation, expected threat, tracking data, VAEP, Doc2Vec, entity resolution, pgvector, defensive valuation, line-breaking passes, physics-based models
"Luxury! We used to dream of serverless!"
Open-source soccer analytics platform built on Databricks Lakebase — replacing a 6-service traditional AWS pipeline with a unified lakehouse architecture that scales to zero. The Hugging Face Hub serves as the public distribution layer for models, datasets, and interactive demos.
Try it now: Full Dashboard — 12-page Streamlit app with live data from 380+ matches across 5 providers. Or explore the Gradio Demo for a quick look.
The infrastructure uses a Medallion architecture (Bronze → Silver → Gold) provisioned entirely via Terraform IaC, unifying multi-vendor event and tracking data into a single analytical layer.
All public artifacts are hosted entirely within the HF ecosystem.
| Model | Architecture | Scale |
|---|---|---|
| football2vec-statsbomb-wyscout | Doc2Vec (PV-DM) 32-dim behavioral embeddings | 87K per-match vectors across 8,950 players from ~3,000 matches |
| xg-model-statsbomb-wyscout | Calibrated XGBoost + logistic baseline (13 features) | Trained on ~131K shots, ROC-AUC 0.979 on held-out test set |
| vaep-model-statsbomb-wyscout | 2× XGBClassifier (P(scores) + P(concedes)) | Trained on ~2,388 matches from StatsBomb + Wyscout |
| xg-v2-model-set-encoder | Deep Sets (Zaheer et al. 2017) + MC dropout (Gal & Ghahramani 2016) | ROC-AUC 0.915, trained on ~131K shots with 360 freeze frames |
All model serialization uses JSON envelopes — zero pickle files (banned by project security policy).
| Dataset | Scale | Description |
|---|---|---|
| spadl-vaep-action-values | ~9.5M actions | Per-action offensive/defensive VAEP valuations |
| line-breaking-passes | ~5M passes | All passes with defensive line-breaking labels via Ward clustering on 360 freeze frames |
| football2vec-player-embeddings | 87K vectors | Pre-computed behavioral (32-d) + statistical (13-d) player vectors |
| pitch-control-tracking | 38M frames | Per-player per-frame Spearman (2017) physics-based pitch control |
| expected-threat-grids | 12x8 grid | Data-driven Expected Threat values computed from 2.2M SPADL actions |
| obso-pausa-inputs | 7 matches | ELASTIC-synced event-tracking inputs for OBSO/PAUSA computation |
| obso-pausa-values | ~3,500 passes | PAUSA pass timing scores with OBSO temporal/spatial decomposition |
| obso-trained-grids | 8 competitions + global | Data-driven ball reachability (100×64) + EPV (50×32) grids for OBSO |
| xg-freeze-frame-data | 137K player rows | StatsBomb 360 freeze-frame player positions for xG v2 set encoder |
| xg-shot-data | 131K shots | Tabular shot features from StatsBomb + Wyscout for xG model training |
| space-creation-values | 875K player-frames | Per-player space creation/destruction via differential OBSO (Fernandez & Bornn 2018) |
| Space | What it is |
|---|---|
| Soccer Analytics App | Full 12-page Streamlit dashboard (Docker SDK) querying Lakebase PostgreSQL via OAuth. Live data from 380+ matches. Shot maps, pass networks, player comparison, pitch control, PAUSA pass timing, DEFCON defensive pressure, and more. |
| Soccer Analytics Demo | Lightweight 6-tab Gradio explorer with pre-cached Parquet data. No database dependency — instant load for quick exploration. |
While Databricks handles core data engineering, we use HF Jobs for workloads where a serverless Python environment is the right tool.
Examples:
vmap — 875K player-frame values across 40K frames in under 6 minutes.All HF Jobs scripts use PEP 723 inline script metadata for zero-setup reproducibility.
Model weights published to HF Hub are synced back to Databricks UC Volumes for inference in the production Streamlit app. This creates a bidirectional flow: Databricks produces training data → HF Hub hosts artifacts → Databricks consumes model weights for scoring.
Every analytics module is grounded in peer-reviewed research, cited directly in the platform UI:
| Module | Foundation |
|---|---|
| Pitch Control | Spearman, "Beyond Expected Goals" (2017) |
| Expected Threat | Karun Singh (2018), Markov chain value iteration |
| VAEP | Decroos et al., "Actions Speak Louder than Goals" (2019) |
| DEFCON | Kim et al., defensive contribution framework (2025) |
| Player Embeddings | Le & Mikolov, Doc2Vec (2014); Theiner et al., football2vec (2022) |
| Line-Breaking | Ward clustering on StatsBomb 360 freeze frames; adapted from Parma Calcio 1913 |
| xG Model | Rathke, "An examination of expected goals" (2017); XGBoost with isotonic calibration |
| PAUSA | Lee et al., "Valuing La Pausa: Quantifying Optimal Pass Timing Beyond Speed" (2026) |
| Space Creation | Fernandez & Bornn, "Wide Open Spaces" (2018), differential OBSO integration |
| xG v2 Set Encoder | Zaheer et al., "Deep Sets" (NeurIPS 2017); Gal & Ghahramani, "Dropout as Bayesian Approximation" (ICML 2016) |
| Pass Networks | Pena & Touchette, "A network theory analysis of football strategies" (2012) |
The platform maintains professional-grade engineering standards:
Named after Monty Python's Four Yorkshiremen sketch, where each comedian one-ups the others about how deprived their childhood was. In data engineering, moving from hand-managed EC2 instances and 5-hop Reverse ETL pipelines to serverless Lakebase truly is... right luxury.