NBA_PREDICTOR / explain.md
jashdoshi77's picture
Add analytics, confidence meter, enhanced H2H, daily MVP refresh
3e6f1d3

NBA Sage - Technical Explanation

An AI-powered NBA game prediction system with real-time data, machine learning, and a modern web interface.


๐ŸŽฏ What Does This Project Do?

NBA Sage is a full-stack application that:

  1. Predicts NBA game outcomes before they happen
  2. Shows live scores with real-time updates
  3. Tracks prediction accuracy over time
  4. Calculates MVP race standings based on current stats
  5. Estimates championship odds for all 30 teams

๐Ÿ† Key Features

Feature Description
Live Game Dashboard Real-time scores, game status, win probabilities
Win Predictions Probability % for each team to win
Starting 5 Lineups Projected starters with PPG stats from NBA API
MVP Race Top 10 MVP candidates with scores
Championship Odds All 30 teams ranked by title probability
Model Accuracy Track how well predictions perform over time

๐Ÿ› ๏ธ Technology Stack

Backend (Python)

Technology Purpose
Flask REST API framework
nba_api Official NBA data (stats.nba.com)
XGBoost + LightGBM Machine learning ensemble model
APScheduler Background job scheduling
ChromaDB Cloud Persistent prediction storage
Pandas/NumPy Data processing

Frontend (React)

Technology Purpose
React 18 UI framework
Vite Build tool & dev server
Custom CSS Modern design system

Infrastructure

Technology Purpose
Docker Container deployment
Hugging Face Spaces Cloud hosting
Git LFS Large file versioning

๐Ÿ”ฌ How Predictions Work

The Prediction Algorithm

Predictions are made using a multi-factor formula:

Win Probability = Log5 Formula of:
โ”œโ”€โ”€ 40% - Current Season Record (Win %)
โ”œโ”€โ”€ 30% - Recent Form (Last 10 games performance)
โ”œโ”€โ”€ 20% - ELO Rating (Historical team strength)
โ””โ”€โ”€ 10% - Baseline

Adjustments Applied:
โ”œโ”€โ”€ +3.5% for Home Court Advantage
โ””โ”€โ”€ -2% per Injury Impact Point

ELO Rating System

ELO is a chess-inspired rating system adapted for NBA:

  • Starting rating: 1500 (average team)
  • K-factor: 20 (how much ratings change per game)
  • Home advantage: +100 ELO points equivalent
  • Season regression: Ratings regress 25% to mean each season

How it works:

  • Win against better team โ†’ Big ELO gain
  • Win against weaker team โ†’ Small ELO gain
  • Lose against better team โ†’ Small ELO loss
  • Lose against weaker team โ†’ Big ELO loss

๐Ÿ“Š Data Sources

Real-Time Data

  • NBA Live API (nba_api.live)
    • Live scores updated every 30 seconds
    • Game status (scheduled, in progress, final)
    • Box scores and player stats

Historical Data

  • NBA Stats API (nba_api.stats)
    • 23 years of game data (2003-2026)
    • Team statistics (basic, advanced, clutch, hustle)
    • Player statistics
    • Current season stats for predictions

Data Storage

  • Parquet files: Cached API responses (~140 files)
  • ChromaDB Cloud: Prediction history and accuracy tracking
  • Joblib files: Trained ML model and processed datasets

๐Ÿง  Machine Learning Components

Trained Model: XGBoost + LightGBM Ensemble

Two gradient boosting models trained on 41,000+ historical games:

Game Features โ”€โ”€โ”ฌโ”€โ”€โ–บ XGBoost (50%) โ”€โ”€โ”
                โ”‚                    โ”‚โ”€โ”€โ–บ Ensemble Prediction
                โ””โ”€โ”€โ–บ LightGBM (50%) โ”€โ”˜

Features Used:

  • ELO ratings and differentials
  • Rolling averages (5, 10, 20 game windows)
  • Rest days and back-to-back games
  • Home/away status
  • Season record statistics

Training Pipeline

Data Collection โ”€โ”€โ–บ Feature Engineering โ”€โ”€โ–บ Model Training โ”€โ”€โ–บ Evaluation
         โ”‚                  โ”‚                    โ”‚
         โ–ผ                  โ–ผ                    โ–ผ
    NBA API Data      ELO Calculation      XGBoost+LightGBM
                      Era Normalization
                      Rolling Windows

Auto-Training System

The system automatically retrains itself:

  1. Ingests completed games every hour
  2. Waits for all daily games to complete
  3. Compares new model accuracy to existing
  4. Only updates if improved (prevents regression)

๐ŸŒ System Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        React Frontend                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”‚
โ”‚  โ”‚LiveGames โ”‚ โ”‚Predictionsโ”‚ โ”‚MVP Race  โ”‚ โ”‚ Accuracy โ”‚           โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ–ฒโ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚            โ”‚            โ”‚            โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚ REST API
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Flask Server                                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚   Endpoints    โ”‚  โ”‚    Caching     โ”‚  โ”‚  Scheduler     โ”‚    โ”‚
โ”‚  โ”‚  /api/live     โ”‚  โ”‚  In-Memory     โ”‚  โ”‚  APScheduler   โ”‚    โ”‚
โ”‚  โ”‚  /api/roster   โ”‚  โ”‚  1-hour rostersโ”‚  โ”‚  Auto-retrain  โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚  โ”‚              Prediction Pipeline                        โ”‚    โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚    โ”‚
โ”‚  โ”‚  โ”‚Live Collectorโ”‚ โ”‚Feature Gen  โ”‚ โ”‚ ELO System  โ”‚       โ”‚    โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      External Services                           โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚  โ”‚  NBA API    โ”‚  โ”‚ ChromaDB    โ”‚  โ”‚ Hugging Faceโ”‚             โ”‚
โ”‚  โ”‚ stats.nba   โ”‚  โ”‚   Cloud     โ”‚  โ”‚   Spaces    โ”‚             โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ Project Structure

NBA ML/
โ”œโ”€โ”€ server.py              # Production server (Hugging Face)
โ”œโ”€โ”€ api/api.py             # Development server
โ”‚
โ”œโ”€โ”€ src/                   # Core logic
โ”‚   โ”œโ”€โ”€ prediction_pipeline.py   # Main orchestrator
โ”‚   โ”œโ”€โ”€ feature_engineering.py   # ELO + features
โ”‚   โ”œโ”€โ”€ data_collector.py        # Historical data
โ”‚   โ”œโ”€โ”€ live_data_collector.py   # Real-time data
โ”‚   โ”œโ”€โ”€ prediction_tracker.py    # Accuracy tracking
โ”‚   โ””โ”€โ”€ models/
โ”‚       โ””โ”€โ”€ game_predictor.py    # ML model
โ”‚
โ”œโ”€โ”€ web/                   # React frontend
โ”‚   โ””โ”€โ”€ src/
โ”‚       โ”œโ”€โ”€ App.jsx
โ”‚       โ”œโ”€โ”€ pages/         # UI pages
โ”‚       โ””โ”€โ”€ index.css      # Design system
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ api_data/          # 140+ parquet files
โ”‚
โ””โ”€โ”€ models/
    โ””โ”€โ”€ game_predictor.joblib  # Trained model (9.6KB)

๐Ÿš€ Deployment

Local Development

# Backend
python api/api.py  # Runs on localhost:8000

# Frontend
cd web && npm run dev  # Runs on localhost:5173

Production (Hugging Face Spaces)

# Docker container
python server.py  # Serves both API + React on port 7860

๐Ÿ“ˆ Performance & Accuracy

Prediction Accuracy

  • Overall: Tracked via ChromaDB Cloud
  • By Confidence: High/Medium/Low confidence splits
  • By Team: Per-team prediction accuracy

Speed Optimizations

  • In-memory caching: Roster data cached for 1 hour
  • Startup warming: All 30 teams pre-loaded on server start
  • Background refresh: Cache updated every 2 hours

๐Ÿ”ฎ Future Improvements

  1. Integrate ML model into live predictions (currently formula-based)
  2. Add player-level features (injuries, rest days per player)
  3. Implement spread predictions (margin of victory)
  4. Add playoff predictions with series outcomes

๐Ÿ“Š Stats at a Glance

Metric Value
Historical games 41,000+
Seasons covered 23 (2003-2026)
Teams tracked 30
ML model type XGBoost + LightGBM
API endpoints 10+
Frontend pages 6

๐Ÿ“‹ Complete ML Feature List (90+ Features)

The model uses approximately 90 features organized into these categories:

1๏ธโƒฃ ELO Rating Features (5 features)

Feature Description
team_elo Team's current ELO rating
opponent_elo Opponent's current ELO rating
elo_diff Difference between team and opponent ELO
elo_win_prob Expected win probability from ELO
home_elo_boost ELO boost for home court (100 points)

2๏ธโƒฃ Basic Stats - Rolling Averages (21 features)

For each of 7 stats ร— 3 windows (5, 10, 20 games):

Base Stat Windows
PTS (Points) PTS_last5, PTS_last10, PTS_last20
AST (Assists) AST_last5, AST_last10, AST_last20
REB (Rebounds) REB_last5, REB_last10, REB_last20
FG_PCT (Field Goal %) FG_PCT_last5, FG_PCT_last10, FG_PCT_last20
FG3_PCT (3-Point %) FG3_PCT_last5, FG3_PCT_last10, FG3_PCT_last20
FT_PCT (Free Throw %) FT_PCT_last5, FT_PCT_last10, FT_PCT_last20
PLUS_MINUS (Point Diff) PLUS_MINUS_last5, PLUS_MINUS_last10, PLUS_MINUS_last20

3๏ธโƒฃ Season Statistics (9 features)

Feature Description
PTS_season_avg Season average points
AST_season_avg Season average assists
REB_season_avg Season average rebounds
FG_PCT_season_avg Season field goal %
FG3_PCT_season_avg Season 3-point %
FT_PCT_season_avg Season free throw %
PLUS_MINUS_season_avg Season point differential
win_pct_season Season win percentage
games_played Games played in season

4๏ธโƒฃ Defensive Features (4 features)

Feature Description
STL_last10 Steals per game (last 10)
BLK_last10 Blocks per game (last 10)
DREB_last10 Defensive rebounds (last 10)
pts_allowed_last10 Points allowed (last 10)

5๏ธโƒฃ Momentum Features (6 features)

Feature Description
wins_last5 Wins in last 5 games (0-5)
wins_last10 Wins in last 10 games (0-10)
hot_streak 1 if 4+ wins in last 5
cold_streak 1 if 1 or fewer wins in last 5
plus_minus_last5 Point differential trend
form_trend Comparison of last 3 vs previous 3

6๏ธโƒฃ Rest & Fatigue Features (4 features)

Feature Description
days_rest Days since last game
back_to_back 1 if playing consecutive days
well_rested 1 if 3+ days rest
games_last_week Games played in last 7 days

7๏ธโƒฃ Form Index Features (3 features)

Feature Description
form_index Exponentially-weighted recent performance (0-1)
form_trend Trend direction (improving/declining)
form_plus_minus Weighted point differential

8๏ธโƒฃ Basic Stat Columns (17 raw features)

BASIC_STATS = [
    "PTS", "AST", "REB", "STL", "BLK", "TOV",
    "FGM", "FGA", "FG_PCT",
    "FG3M", "FG3A", "FG3_PCT",
    "FTM", "FTA", "FT_PCT",
    "OREB", "DREB"
]

9๏ธโƒฃ Advanced Team Stats (11 features)

ADVANCED_STATS = [
    "E_OFF_RATING",    # Offensive Rating
    "E_DEF_RATING",    # Defensive Rating
    "E_NET_RATING",    # Net Rating
    "E_PACE",          # Pace (possessions per game)
    "E_AST_RATIO",     # Assist Ratio
    "E_OREB_PCT",      # Offensive Rebound %
    "E_DREB_PCT",      # Defensive Rebound %
    "E_REB_PCT",       # Total Rebound %
    "E_TM_TOV_PCT",    # Team Turnover %
    "E_EFG_PCT",       # Effective FG%
    "E_TS_PCT"         # True Shooting %
]

๐Ÿ”Ÿ Clutch Stats (4 features)

CLUTCH_STATS = [
    "CLUTCH_PTS",          # Points in clutch time
    "CLUTCH_FG_PCT",       # FG% in clutch
    "CLUTCH_FG3_PCT",      # 3PT% in clutch
    "CLUTCH_PLUS_MINUS"    # +/- in clutch
]

1๏ธโƒฃ1๏ธโƒฃ Hustle Stats (5 features)

HUSTLE_STATS = [
    "DEFLECTIONS",             # Passes deflected
    "LOOSE_BALLS_RECOVERED",   # Loose balls recovered
    "CHARGES_DRAWN",           # Offensive fouls drawn
    "CONTESTED_SHOTS",         # Shots contested
    "SCREEN_ASSISTS"           # Screen assists
]

1๏ธโƒฃ2๏ธโƒฃ Top Player Stats (6 features)

Feature Description
top_players_avg_pts Avg points of top 5 players
top_players_avg_ast Avg assists of top 5 players
top_players_avg_reb Avg rebounds of top 5 players
top_players_avg_stl Avg steals of top 5 players
top_players_avg_blk Avg blocks of top 5 players
star_concentration % of scoring from top player

1๏ธโƒฃ3๏ธโƒฃ Game Context (1 feature)

Feature Description
is_home 1 if home team, 0 if away

๐Ÿ“Š Feature Summary

Category Feature Count
ELO Ratings 5
Rolling Averages (5/10/20) 21
Season Statistics 9
Defensive Stats 4
Momentum Features 6
Rest/Fatigue 4
Form Index 3
Advanced Team Stats 11
Clutch Stats 4
Hustle Stats 5
Top Player Stats 6
Game Context 1
TOTAL ~79 core features

Plus Z-score normalized versions of stats for era adjustment = 90+ total features


Built with Python, React, and a passion for basketball analytics ๐Ÿ€