NBA_PREDICTOR / claude.md
jashdoshi77's picture
Add analytics, confidence meter, enhanced H2H, daily MVP refresh
3e6f1d3

NBA Sage - Complete Codebase Context for AI Assistants

Purpose: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.


πŸ—οΈ Project Architecture Overview

NBA ML/
β”œβ”€β”€ server.py              # Main production server (Flask + React)
β”œβ”€β”€ api/api.py             # Development API server
β”‚
β”œβ”€β”€ src/                   # Core Python modules
β”‚   β”œβ”€β”€ prediction_pipeline.py   # Main prediction orchestrator
β”‚   β”œβ”€β”€ feature_engineering.py   # ELO + feature generation
β”‚   β”œβ”€β”€ data_collector.py        # Historical NBA API data
β”‚   β”œβ”€β”€ live_data_collector.py   # Real-time game data
β”‚   β”œβ”€β”€ injury_collector.py      # Player injury tracking
β”‚   β”œβ”€β”€ prediction_tracker.py    # ChromaDB prediction storage
β”‚   β”œβ”€β”€ auto_trainer.py          # Automated training scheduler
β”‚   β”œβ”€β”€ continuous_learner.py    # Incremental model updates
β”‚   β”œβ”€β”€ preprocessing.py         # Data preprocessing
β”‚   β”œβ”€β”€ config.py                # Global configuration
β”‚   └── models/
β”‚       β”œβ”€β”€ game_predictor.py    # XGBoost+LightGBM ensemble
β”‚       β”œβ”€β”€ mvp_predictor.py     # MVP prediction model
β”‚       └── championship_predictor.py
β”‚
β”œβ”€β”€ web/                   # React Frontend
β”‚   └── src/
β”‚       β”œβ”€β”€ App.jsx        # Main app with sidebar navigation
β”‚       β”œβ”€β”€ pages/         # LiveGames, Predictions, MVP, etc.
β”‚       β”œβ”€β”€ api.js         # API client
β”‚       └── index.css      # Comprehensive CSS design system
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ api_data/          # Cached NBA API responses (parquet)
β”‚   β”œβ”€β”€ processed/         # Processed datasets (joblib)
β”‚   └── raw/               # Raw game data
β”‚
└── models/
    └── game_predictor.joblib  # Trained ML model

πŸ”„ Data Flow

NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
                                       β”‚
                                       β–Ό
Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
                                       β”‚
                                       β–Ό
                              Prediction Tracker (ChromaDB)

πŸ“¦ Core Components Deep Dive

1. server.py - Production Server (39KB, 929 lines)

Critical for Hugging Face deployment. Combines Flask API + React static serving.

Key Sections:

  • Cache Configuration (lines 30-40): In-memory caching for rosters, predictions, live games
  • Startup Cache Warming (lines 140-225): warm_starter_cache() fetches all 30 team rosters on startup
  • Background Scheduler (lines 340-370): APScheduler jobs for ELO updates, retraining, prediction sync
  • API Endpoints (lines 400-860): All REST endpoints for frontend

Important Functions:

warm_starter_cache()      # Fetches real NBA API data for all teams
startup_cache_warming()   # Runs synchronously on server start
auto_retrain_model()      # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings()      # Daily ELO recalculation

Endpoints:

  • GET /api/live-games - Today's games with predictions
  • GET /api/roster/<team> - Team's projected starting 5
  • GET /api/accuracy - Model accuracy statistics
  • GET /api/mvp - MVP race standings
  • GET /api/championship - Championship odds

2. prediction_pipeline.py - Prediction Orchestrator (41KB, 765 lines)

The heart of the system. Orchestrates all predictions.

Key Properties:

self.live_collector      # LiveDataCollector instance
self.injury_collector    # InjuryCollector instance
self.feature_gen         # FeatureGenerator instance
self.tracker             # PredictionTracker (ChromaDB)
self._game_model         # Lazy-loaded GamePredictor

Important Methods:

Method Purpose
predict_game(home, away) Generate single game prediction
get_upcoming_games(days) Fetch future NBA schedule
get_mvp_race() Calculate MVP standings from live stats
get_championship_odds() Calculate championship probabilities
get_team_roster(team) Fast fallback roster data

⚠️ CRITICAL: Prediction Algorithm (lines 349-504)

The predict_game() method uses a formula-based approach, NOT the trained ML model:

# Weights in predict_game():
home_talent = (
    0.40 * home_win_pct +      # Current season record
    0.30 * home_form +          # Last 10 games
    0.20 * home_elo_strength +  # Historical ELO
    0.10 * 0.5                  # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability

The trained GamePredictor model exists but is NOT called for live predictions.


3. feature_engineering.py - Feature Generation (29KB, 696 lines)

Contains ELO system and all feature generation logic.

Classes:

Class Purpose Key Methods
ELOCalculator ELO rating system update_ratings(), calculate_game_features()
EraNormalizer Z-score normalization across seasons fit_season(), transform()
StatLoader Load all stat types get_team_season_stats(), get_team_top_players_stats()
FeatureGenerator Main feature orchestrator generate_game_features(), generate_features_for_dataset()

ELO Configuration:

initial_rating = 1500
k_factor = 20
home_advantage = 100  # ELO points for home court
regression_factor = 0.25  # Season regression to mean

Feature Types Generated:

  • ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
  • Rolling averages (5, 10, 20 game windows)
  • Rest days, back-to-back detection
  • Season record features
  • Head-to-head history

4. data_collector.py - Historical Data (27KB, 650 lines)

Collects comprehensive NBA data from official API.

Classes:

Class Data Collected
GameDataCollector Game results per season
TeamDataCollector Team stats (basic, advanced, clutch, hustle, defense)
PlayerDataCollector Player stats
CacheManager Parquet file caching

Key Features:

  • Exponential backoff retry for rate limiting
  • Per-season parquet caching
  • Checkpoint system for resumable collection

5. live_data_collector.py - Real-Time Data (9KB, 236 lines)

Uses nba_api.live endpoints for real-time game data.

Key Methods:

get_live_scoreboard()  # Today's games with live scores
get_game_boxscore(id)  # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL

Data Fields Returned:

  • game_id, game_code
  • home_team, away_team (tricodes)
  • home_score, away_score
  • period, clock
  • status

6. prediction_tracker.py - Persistence (20KB, 508 lines)

Stores predictions and tracks accuracy using ChromaDB Cloud.

Features:

  • ChromaDB Cloud integration (with local JSON fallback)
  • Prediction storage before games start
  • Result updating after games complete
  • Comprehensive accuracy statistics

Key Methods:

save_prediction(game_id, prediction)  # Store pre-game prediction
update_result(game_id, winner, scores)  # Update with final result
get_accuracy_stats()  # Overall, by confidence, by team
get_pending_predictions()  # Awaiting results

7. models/game_predictor.py - ML Model (12KB, 332 lines)

XGBoost + LightGBM ensemble classifier.

Architecture:

Input Features ──┬──► XGBoost ──┐
                 β”‚              │──► Weighted Average ──► Win Probability
                 └──► LightGBM β”€β”˜
                      (50/50 weight)

Key Methods:

train(X_train, y_train, X_val, y_val)  # Train both models
predict_proba(X)  # Get [loss_prob, win_prob]
predict_with_confidence(X)  # Detailed prediction info
explain_prediction(X)  # Feature importance for prediction
save() / load()  # Persist to models/game_predictor.joblib

⚠️ NOTE: Model exists but predict_game() doesn't use it!


8. auto_trainer.py & continuous_learner.py - Auto Training

AutoTrainer (Singleton scheduler):

  • Runs background loop checking for tasks
  • Ingests completed games every hour
  • Smart retraining: only after ALL daily games complete
  • If new accuracy < old accuracy, reverts model

ContinuousLearner (Update workflow):

ingest_completed_games() ──► update_features() ──► retrain_model()

πŸ—„οΈ Database & Storage

ChromaDB Cloud

  • Purpose: Persistent prediction storage
  • Credentials: Set via environment variables (CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY)
  • Fallback: data/processed/predictions_local.json

Parquet Files

  • data/api_data/*.parquet - Cached API responses
  • data/api_data/all_games_summary.parquet - Consolidated game history (41K+ games)

Joblib Files

  • models/game_predictor.joblib - Trained ML model
  • data/processed/game_dataset.joblib - Processed training data

🌐 Frontend Architecture

React + Vite with custom CSS design system.

Pages:

Page File Purpose
Live Games LiveGames.jsx Today's games, live scores, predictions
Predictions Predictions.jsx Upcoming games with predictions
Head to Head HeadToHead.jsx Compare two teams
Accuracy Accuracy.jsx Model performance stats
MVP Race MvpRace.jsx Current MVP standings
Championship Championship.jsx Championship odds

Key Frontend Components:

  • TeamLogo.jsx - Official NBA team logos
  • api.js - API client with base URL handling
  • index.css - Complete design system (27KB)

πŸ”§ Configuration (src/config.py)

Critical Settings:

# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}

# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")

# Feature engineering
FEATURE_CONFIG = {
    "rolling_windows": [5, 10, 20],
    "min_games_for_features": 5
}

# ELO system
ELO_CONFIG = {
    "initial_rating": 1500,
    "k_factor": 20,
    "home_advantage": 100
}

⚠️ Known Issues & Technical Debt

  1. ML Model Not Used: predict_game() uses formula, not trained GamePredictor
  2. Season Hardcoding: Some places use 2025-26 explicitly
  3. Fallback Data: Pipeline has hardcoded rosters as backup
  4. Function Order: warm_starter_cache() must be defined before scheduler calls it

πŸš€ Deployment Notes

Hugging Face Spaces:

  • Uses persistent /data directory for storage
  • Dockerfile copies models/ and data/api_data/
  • Git LFS for large files (.joblib, .parquet)
  • Port 7860 for HF Spaces

Environment Variables:

CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY  # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR  # Override paths

πŸ“‹ Quick Reference: Common Tasks

Add new API endpoint:

  1. Add route in server.py (production) AND api/api.py (development)
  2. Add frontend call in web/src/api.js
  3. Create/update page component in web/src/pages/

Modify prediction algorithm:

  1. Edit PredictionPipeline.predict_game() in prediction_pipeline.py
  2. Consider blending with GamePredictor model

Update ML model:

  1. Retrain via ContinuousLearner.retrain_model()
  2. Or trigger via POST /api/admin/retrain

Add new feature:

  1. Add to FeatureGenerator in feature_engineering.py
  2. Update preprocessing pipeline
  3. Retrain model

Last updated: January 2026