AI & ML interests

Soccer analytics, sports analytics, player embeddings, pitch control, action valuation, expected threat, tracking data, VAEP, Doc2Vec, entity resolution, pgvector, defensive valuation, line-breaking passes, physics-based models

Recent Activity

karstenskyt  updated a dataset about 3 hours ago
luxury-lakehouse/space-creation-values
karstenskyt  updated a dataset about 3 hours ago
luxury-lakehouse/obso-pausa-values
karstenskyt  updated a model about 3 hours ago
luxury-lakehouse/xg-v2-model-set-encoder
View all activity

Organization Card

Luxury Lakehouse

(Right! Luxury!) Lakehouse

"Luxury! We used to dream of serverless!"

Open-source soccer analytics platform built on Databricks Lakebase — replacing a 6-service traditional AWS pipeline with a unified lakehouse architecture that scales to zero. The Hugging Face Hub serves as the public distribution layer for models, datasets, and interactive demos.

Try it now: Full Dashboard — 12-page Streamlit app with live data from 380+ matches across 5 providers. Or explore the Gradio Demo for a quick look.


Platform Scale & Data Engineering

The infrastructure uses a Medallion architecture (Bronze → Silver → Gold) provisioned entirely via Terraform IaC, unifying multi-vendor event and tracking data into a single analytical layer.

  • 38M+ tracking frames ingested from three optical tracking providers (25fps and 10fps)
  • 5 distinct data sources unified: StatsBomb, Wyscout, Metrica Sports, IDSSE (Bundesliga), and SkillCorner (A-League)
  • 12 Streamlit dashboard pages deployed on HuggingFace Spaces (Docker SDK), querying Lakebase PostgreSQL via OAuth
  • 19 synced tables with Zero-ETL continuous sync from Gold Delta Lake to Lakebase PostgreSQL 17
  • 38 PostgreSQL indexes (34 btree + 4 HNSW vector indexes) for sub-10ms OLTP queries
  • Pipeline reliability enforced through 807 unit tests (819+ with gensim) and 381 dbt data tests

The Hugging Face Footprint

All public artifacts are hosted entirely within the HF ecosystem.

Models

Model Architecture Scale
football2vec-statsbomb-wyscout Doc2Vec (PV-DM) 32-dim behavioral embeddings 87K per-match vectors across 8,950 players from ~3,000 matches
xg-model-statsbomb-wyscout Calibrated XGBoost + logistic baseline (13 features) Trained on ~131K shots, ROC-AUC 0.979 on held-out test set
vaep-model-statsbomb-wyscout 2× XGBClassifier (P(scores) + P(concedes)) Trained on ~2,388 matches from StatsBomb + Wyscout
xg-v2-model-set-encoder Deep Sets (Zaheer et al. 2017) + MC dropout (Gal & Ghahramani 2016) ROC-AUC 0.915, trained on ~131K shots with 360 freeze frames

All model serialization uses JSON envelopes — zero pickle files (banned by project security policy).

Datasets

Dataset Scale Description
spadl-vaep-action-values ~9.5M actions Per-action offensive/defensive VAEP valuations
line-breaking-passes ~5M passes All passes with defensive line-breaking labels via Ward clustering on 360 freeze frames
football2vec-player-embeddings 87K vectors Pre-computed behavioral (32-d) + statistical (13-d) player vectors
pitch-control-tracking 38M frames Per-player per-frame Spearman (2017) physics-based pitch control
expected-threat-grids 12x8 grid Data-driven Expected Threat values computed from 2.2M SPADL actions
obso-pausa-inputs 7 matches ELASTIC-synced event-tracking inputs for OBSO/PAUSA computation
obso-pausa-values ~3,500 passes PAUSA pass timing scores with OBSO temporal/spatial decomposition
obso-trained-grids 8 competitions + global Data-driven ball reachability (100×64) + EPV (50×32) grids for OBSO
xg-freeze-frame-data 137K player rows StatsBomb 360 freeze-frame player positions for xG v2 set encoder
xg-shot-data 131K shots Tabular shot features from StatsBomb + Wyscout for xG model training
space-creation-values 875K player-frames Per-player space creation/destruction via differential OBSO (Fernandez & Bornn 2018)

Interactive Spaces

Space What it is
Soccer Analytics App Full 12-page Streamlit dashboard (Docker SDK) querying Lakebase PostgreSQL via OAuth. Live data from 380+ matches. Shot maps, pass networks, player comparison, pitch control, PAUSA pass timing, DEFCON defensive pressure, and more.
Soccer Analytics Demo Lightweight 6-tab Gradio explorer with pre-cached Parquet data. No database dependency — instant load for quick exploration.

Compute & Bidirectional Sync

While Databricks handles core data engineering, we use HF Jobs for workloads where a serverless Python environment is the right tool.

Examples:

  • Expected Threat grids run as a CPU-based HF Jobs pipeline — downloads SPADL data from an HF Dataset, computes Markov chain value iteration, and publishes xT grids back to the Hub.
  • xG v2 neural model trains on an A10G GPU via HF Jobs — a Deep Sets architecture with MC dropout, processing 131K shots with 360 freeze-frame context, exporting pure-NumPy weights for serverless inference.
  • Space Creation computes per-player counterfactual pitch control surfaces on A10G via JAX double-vmap — 875K player-frame values across 40K frames in under 6 minutes.

All HF Jobs scripts use PEP 723 inline script metadata for zero-setup reproducibility.

Model weights published to HF Hub are synced back to Databricks UC Volumes for inference in the production Streamlit app. This creates a bidirectional flow: Databricks produces training data → HF Hub hosts artifacts → Databricks consumes model weights for scoring.

Academic Foundations

Every analytics module is grounded in peer-reviewed research, cited directly in the platform UI:

Module Foundation
Pitch Control Spearman, "Beyond Expected Goals" (2017)
Expected Threat Karun Singh (2018), Markov chain value iteration
VAEP Decroos et al., "Actions Speak Louder than Goals" (2019)
DEFCON Kim et al., defensive contribution framework (2025)
Player Embeddings Le & Mikolov, Doc2Vec (2014); Theiner et al., football2vec (2022)
Line-Breaking Ward clustering on StatsBomb 360 freeze frames; adapted from Parma Calcio 1913
xG Model Rathke, "An examination of expected goals" (2017); XGBoost with isotonic calibration
PAUSA Lee et al., "Valuing La Pausa: Quantifying Optimal Pass Timing Beyond Speed" (2026)
Space Creation Fernandez & Bornn, "Wide Open Spaces" (2018), differential OBSO integration
xG v2 Set Encoder Zaheer et al., "Deep Sets" (NeurIPS 2017); Gal & Ghahramani, "Dropout as Bayesian Approximation" (ICML 2016)
Pass Networks Pena & Touchette, "A network theory analysis of football strategies" (2012)

Engineering Quality

The platform maintains professional-grade engineering standards:

  • Security: OAuth M2M everywhere, HTTPS-only, zero secrets in code, input validation on all identifiers, SSL verification enforced, JSON-only model serialization
  • Type safety: Pyright basic mode, Pydantic models for configuration
  • Testing: 807 pytest unit tests (819+ with gensim, including performance benchmarks), 381 dbt data quality tests
  • CI/CD: GitHub Actions with OIDC federation (zero-secret CI), ruff linting, pre-commit hooks
  • UX discipline: 71 of 78 findings resolved across two cognitive interface audits (CHI-AUDIT-180, CHI-AUDIT-190), grounded in 15 HCI frameworks (Norman, Sweller, Gergle, Kahneman, Cleveland & McGill, and others) — every metric has a help tooltip, every page has academic citations, every analytics term is defined in a context-sensitive glossary (Streamlit and HF Space)

Links

Named after Monty Python's Four Yorkshiremen sketch, where each comedian one-ups the others about how deprived their childhood was. In data engineering, moving from hand-managed EC2 instances and 5-hop Reverse ETL pipelines to serverless Lakebase truly is... right luxury.