Spaces:

jashdoshi77
/

NBA_PREDICTOR

Running

App Files Files Community

NBA_PREDICTOR / claude.md

jashdoshi77

Add analytics, confidence meter, enhanced H2H, daily MVP refresh

3e6f1d3 about 1 month ago

preview code

raw

history blame contribute delete

12 kB

NBA Sage - Complete Codebase Context for AI Assistants

Purpose: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.

🏗️ Project Architecture Overview

NBA ML/
├── server.py              # Main production server (Flask + React)
├── api/api.py             # Development API server
│
├── src/                   # Core Python modules
│   ├── prediction_pipeline.py   # Main prediction orchestrator
│   ├── feature_engineering.py   # ELO + feature generation
│   ├── data_collector.py        # Historical NBA API data
│   ├── live_data_collector.py   # Real-time game data
│   ├── injury_collector.py      # Player injury tracking
│   ├── prediction_tracker.py    # ChromaDB prediction storage
│   ├── auto_trainer.py          # Automated training scheduler
│   ├── continuous_learner.py    # Incremental model updates
│   ├── preprocessing.py         # Data preprocessing
│   ├── config.py                # Global configuration
│   └── models/
│       ├── game_predictor.py    # XGBoost+LightGBM ensemble
│       ├── mvp_predictor.py     # MVP prediction model
│       └── championship_predictor.py
│
├── web/                   # React Frontend
│   └── src/
│       ├── App.jsx        # Main app with sidebar navigation
│       ├── pages/         # LiveGames, Predictions, MVP, etc.
│       ├── api.js         # API client
│       └── index.css      # Comprehensive CSS design system
│
├── data/
│   ├── api_data/          # Cached NBA API responses (parquet)
│   ├── processed/         # Processed datasets (joblib)
│   └── raw/               # Raw game data
│
└── models/
    └── game_predictor.joblib  # Trained ML model

🔄 Data Flow

NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
                                       │
                                       ▼
Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
                                       │
                                       ▼
                              Prediction Tracker (ChromaDB)

📦 Core Components Deep Dive

1. `server.py` - Production Server (39KB, 929 lines)

Critical for Hugging Face deployment. Combines Flask API + React static serving.

Key Sections:

Cache Configuration (lines 30-40): In-memory caching for rosters, predictions, live games
Startup Cache Warming (lines 140-225): warm_starter_cache() fetches all 30 team rosters on startup
Background Scheduler (lines 340-370): APScheduler jobs for ELO updates, retraining, prediction sync
API Endpoints (lines 400-860): All REST endpoints for frontend

Important Functions:

warm_starter_cache()      # Fetches real NBA API data for all teams
startup_cache_warming()   # Runs synchronously on server start
auto_retrain_model()      # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings()      # Daily ELO recalculation

Endpoints:

GET /api/live-games - Today's games with predictions
GET /api/roster/<team> - Team's projected starting 5
GET /api/accuracy - Model accuracy statistics
GET /api/mvp - MVP race standings
GET /api/championship - Championship odds

2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)

The heart of the system. Orchestrates all predictions.

Key Properties:

self.live_collector      # LiveDataCollector instance
self.injury_collector    # InjuryCollector instance
self.feature_gen         # FeatureGenerator instance
self.tracker             # PredictionTracker (ChromaDB)
self._game_model         # Lazy-loaded GamePredictor

Important Methods:

Method	Purpose
`predict_game(home, away)`	Generate single game prediction
`get_upcoming_games(days)`	Fetch future NBA schedule
`get_mvp_race()`	Calculate MVP standings from live stats
`get_championship_odds()`	Calculate championship probabilities
`get_team_roster(team)`	Fast fallback roster data

⚠️ CRITICAL: Prediction Algorithm (lines 349-504)

The predict_game() method uses a formula-based approach, NOT the trained ML model:

# Weights in predict_game():
home_talent = (
    0.40 * home_win_pct +      # Current season record
    0.30 * home_form +          # Last 10 games
    0.20 * home_elo_strength +  # Historical ELO
    0.10 * 0.5                  # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability

The trained GamePredictor model exists but is NOT called for live predictions.

3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)

Contains ELO system and all feature generation logic.

Classes:

Class	Purpose	Key Methods
`ELOCalculator`	ELO rating system	`update_ratings()`, `calculate_game_features()`
`EraNormalizer`	Z-score normalization across seasons	`fit_season()`, `transform()`
`StatLoader`	Load all stat types	`get_team_season_stats()`, `get_team_top_players_stats()`
`FeatureGenerator`	Main feature orchestrator	`generate_game_features()`, `generate_features_for_dataset()`

ELO Configuration:

initial_rating = 1500
k_factor = 20
home_advantage = 100  # ELO points for home court
regression_factor = 0.25  # Season regression to mean

Feature Types Generated:

ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
Rolling averages (5, 10, 20 game windows)
Rest days, back-to-back detection
Season record features
Head-to-head history

4. `data_collector.py` - Historical Data (27KB, 650 lines)

Collects comprehensive NBA data from official API.

Classes:

Class	Data Collected
`GameDataCollector`	Game results per season
`TeamDataCollector`	Team stats (basic, advanced, clutch, hustle, defense)
`PlayerDataCollector`	Player stats
`CacheManager`	Parquet file caching

Key Features:

Exponential backoff retry for rate limiting
Per-season parquet caching
Checkpoint system for resumable collection

5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)

Uses nba_api.live endpoints for real-time game data.

Key Methods:

get_live_scoreboard()  # Today's games with live scores
get_game_boxscore(id)  # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL

Data Fields Returned:

game_id, game_code
home_team, away_team (tricodes)
home_score, away_score
period, clock
status

6. `prediction_tracker.py` - Persistence (20KB, 508 lines)

Stores predictions and tracks accuracy using ChromaDB Cloud.

Features:

ChromaDB Cloud integration (with local JSON fallback)
Prediction storage before games start
Result updating after games complete
Comprehensive accuracy statistics

Key Methods:

save_prediction(game_id, prediction)  # Store pre-game prediction
update_result(game_id, winner, scores)  # Update with final result
get_accuracy_stats()  # Overall, by confidence, by team
get_pending_predictions()  # Awaiting results

7. `models/game_predictor.py` - ML Model (12KB, 332 lines)

XGBoost + LightGBM ensemble classifier.

Architecture:

Input Features ──┬──► XGBoost ──┐
                 │              │──► Weighted Average ──► Win Probability
                 └──► LightGBM ─┘
                      (50/50 weight)

Key Methods:

train(X_train, y_train, X_val, y_val)  # Train both models
predict_proba(X)  # Get [loss_prob, win_prob]
predict_with_confidence(X)  # Detailed prediction info
explain_prediction(X)  # Feature importance for prediction
save() / load()  # Persist to models/game_predictor.joblib

⚠️ NOTE: Model exists but predict_game() doesn't use it!

8. `auto_trainer.py` & `continuous_learner.py` - Auto Training

AutoTrainer (Singleton scheduler):

Runs background loop checking for tasks
Ingests completed games every hour
Smart retraining: only after ALL daily games complete
If new accuracy < old accuracy, reverts model

ContinuousLearner (Update workflow):

ingest_completed_games() ──► update_features() ──► retrain_model()

🗄️ Database & Storage

ChromaDB Cloud

Purpose: Persistent prediction storage
Credentials: Set via environment variables (CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY)
Fallback: data/processed/predictions_local.json

Parquet Files

data/api_data/*.parquet - Cached API responses
data/api_data/all_games_summary.parquet - Consolidated game history (41K+ games)

Joblib Files

models/game_predictor.joblib - Trained ML model
data/processed/game_dataset.joblib - Processed training data

🌐 Frontend Architecture

React + Vite with custom CSS design system.

Pages:

Page	File	Purpose
Live Games	`LiveGames.jsx`	Today's games, live scores, predictions
Predictions	`Predictions.jsx`	Upcoming games with predictions
Head to Head	`HeadToHead.jsx`	Compare two teams
Accuracy	`Accuracy.jsx`	Model performance stats
MVP Race	`MvpRace.jsx`	Current MVP standings
Championship	`Championship.jsx`	Championship odds

Key Frontend Components:

TeamLogo.jsx - Official NBA team logos
api.js - API client with base URL handling
index.css - Complete design system (27KB)

🔧 Configuration (`src/config.py`)

Critical Settings:

# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}

# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")

# Feature engineering
FEATURE_CONFIG = {
    "rolling_windows": [5, 10, 20],
    "min_games_for_features": 5
}

# ELO system
ELO_CONFIG = {
    "initial_rating": 1500,
    "k_factor": 20,
    "home_advantage": 100
}

⚠️ Known Issues & Technical Debt

ML Model Not Used: predict_game() uses formula, not trained GamePredictor
Season Hardcoding: Some places use 2025-26 explicitly
Fallback Data: Pipeline has hardcoded rosters as backup
Function Order: warm_starter_cache() must be defined before scheduler calls it

🚀 Deployment Notes

Hugging Face Spaces:

Uses persistent /data directory for storage
Dockerfile copies models/ and data/api_data/
Git LFS for large files (.joblib, .parquet)
Port 7860 for HF Spaces

Environment Variables:

CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY  # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR  # Override paths

📋 Quick Reference: Common Tasks

Add new API endpoint:

Add route in server.py (production) AND api/api.py (development)
Add frontend call in web/src/api.js
Create/update page component in web/src/pages/

Modify prediction algorithm:

Edit PredictionPipeline.predict_game() in prediction_pipeline.py
Consider blending with GamePredictor model

Update ML model:

Retrain via ContinuousLearner.retrain_model()
Or trigger via POST /api/admin/retrain

Add new feature:

Add to FeatureGenerator in feature_engineering.py
Update preprocessing pipeline
Retrain model

Last updated: January 2026

NBA Sage - Complete Codebase Context for AI Assistants

🏗️ Project Architecture Overview

🔄 Data Flow

📦 Core Components Deep Dive

1. server.py - Production Server (39KB, 929 lines)

2. prediction_pipeline.py - Prediction Orchestrator (41KB, 765 lines)

3. feature_engineering.py - Feature Generation (29KB, 696 lines)

4. data_collector.py - Historical Data (27KB, 650 lines)

5. live_data_collector.py - Real-Time Data (9KB, 236 lines)

6. prediction_tracker.py - Persistence (20KB, 508 lines)

7. models/game_predictor.py - ML Model (12KB, 332 lines)

8. auto_trainer.py & continuous_learner.py - Auto Training

🗄️ Database & Storage

ChromaDB Cloud

Parquet Files

Joblib Files

🌐 Frontend Architecture

🔧 Configuration (src/config.py)

⚠️ Known Issues & Technical Debt

🚀 Deployment Notes

📋 Quick Reference: Common Tasks

1. `server.py` - Production Server (39KB, 929 lines)

2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)

3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)

4. `data_collector.py` - Historical Data (27KB, 650 lines)

5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)

6. `prediction_tracker.py` - Persistence (20KB, 508 lines)

7. `models/game_predictor.py` - ML Model (12KB, 332 lines)

8. `auto_trainer.py` & `continuous_learner.py` - Auto Training

🔧 Configuration (`src/config.py`)