Spaces:
Running
NBA Sage - Complete Codebase Context for AI Assistants
Purpose: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.
ποΈ Project Architecture Overview
NBA ML/
βββ server.py # Main production server (Flask + React)
βββ api/api.py # Development API server
β
βββ src/ # Core Python modules
β βββ prediction_pipeline.py # Main prediction orchestrator
β βββ feature_engineering.py # ELO + feature generation
β βββ data_collector.py # Historical NBA API data
β βββ live_data_collector.py # Real-time game data
β βββ injury_collector.py # Player injury tracking
β βββ prediction_tracker.py # ChromaDB prediction storage
β βββ auto_trainer.py # Automated training scheduler
β βββ continuous_learner.py # Incremental model updates
β βββ preprocessing.py # Data preprocessing
β βββ config.py # Global configuration
β βββ models/
β βββ game_predictor.py # XGBoost+LightGBM ensemble
β βββ mvp_predictor.py # MVP prediction model
β βββ championship_predictor.py
β
βββ web/ # React Frontend
β βββ src/
β βββ App.jsx # Main app with sidebar navigation
β βββ pages/ # LiveGames, Predictions, MVP, etc.
β βββ api.js # API client
β βββ index.css # Comprehensive CSS design system
β
βββ data/
β βββ api_data/ # Cached NBA API responses (parquet)
β βββ processed/ # Processed datasets (joblib)
β βββ raw/ # Raw game data
β
βββ models/
βββ game_predictor.joblib # Trained ML model
π Data Flow
NBA API βββΊ Data Collectors βββΊ Feature Engineering βββΊ ML Training
β
βΌ
Live API βββΊ Live Collector βββΊ Prediction Pipeline βββΊ Flask API βββΊ React UI
β
βΌ
Prediction Tracker (ChromaDB)
π¦ Core Components Deep Dive
1. server.py - Production Server (39KB, 929 lines)
Critical for Hugging Face deployment. Combines Flask API + React static serving.
Key Sections:
- Cache Configuration (lines 30-40): In-memory caching for rosters, predictions, live games
- Startup Cache Warming (lines 140-225):
warm_starter_cache()fetches all 30 team rosters on startup - Background Scheduler (lines 340-370): APScheduler jobs for ELO updates, retraining, prediction sync
- API Endpoints (lines 400-860): All REST endpoints for frontend
Important Functions:
warm_starter_cache() # Fetches real NBA API data for all teams
startup_cache_warming() # Runs synchronously on server start
auto_retrain_model() # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings() # Daily ELO recalculation
Endpoints:
GET /api/live-games- Today's games with predictionsGET /api/roster/<team>- Team's projected starting 5GET /api/accuracy- Model accuracy statisticsGET /api/mvp- MVP race standingsGET /api/championship- Championship odds
2. prediction_pipeline.py - Prediction Orchestrator (41KB, 765 lines)
The heart of the system. Orchestrates all predictions.
Key Properties:
self.live_collector # LiveDataCollector instance
self.injury_collector # InjuryCollector instance
self.feature_gen # FeatureGenerator instance
self.tracker # PredictionTracker (ChromaDB)
self._game_model # Lazy-loaded GamePredictor
Important Methods:
| Method | Purpose |
|---|---|
predict_game(home, away) |
Generate single game prediction |
get_upcoming_games(days) |
Fetch future NBA schedule |
get_mvp_race() |
Calculate MVP standings from live stats |
get_championship_odds() |
Calculate championship probabilities |
get_team_roster(team) |
Fast fallback roster data |
β οΈ CRITICAL: Prediction Algorithm (lines 349-504)
The predict_game() method uses a formula-based approach, NOT the trained ML model:
# Weights in predict_game():
home_talent = (
0.40 * home_win_pct + # Current season record
0.30 * home_form + # Last 10 games
0.20 * home_elo_strength + # Historical ELO
0.10 * 0.5 # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability
The trained GamePredictor model exists but is NOT called for live predictions.
3. feature_engineering.py - Feature Generation (29KB, 696 lines)
Contains ELO system and all feature generation logic.
Classes:
| Class | Purpose | Key Methods |
|---|---|---|
ELOCalculator |
ELO rating system | update_ratings(), calculate_game_features() |
EraNormalizer |
Z-score normalization across seasons | fit_season(), transform() |
StatLoader |
Load all stat types | get_team_season_stats(), get_team_top_players_stats() |
FeatureGenerator |
Main feature orchestrator | generate_game_features(), generate_features_for_dataset() |
ELO Configuration:
initial_rating = 1500
k_factor = 20
home_advantage = 100 # ELO points for home court
regression_factor = 0.25 # Season regression to mean
Feature Types Generated:
- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
- Rolling averages (5, 10, 20 game windows)
- Rest days, back-to-back detection
- Season record features
- Head-to-head history
4. data_collector.py - Historical Data (27KB, 650 lines)
Collects comprehensive NBA data from official API.
Classes:
| Class | Data Collected |
|---|---|
GameDataCollector |
Game results per season |
TeamDataCollector |
Team stats (basic, advanced, clutch, hustle, defense) |
PlayerDataCollector |
Player stats |
CacheManager |
Parquet file caching |
Key Features:
- Exponential backoff retry for rate limiting
- Per-season parquet caching
- Checkpoint system for resumable collection
5. live_data_collector.py - Real-Time Data (9KB, 236 lines)
Uses nba_api.live endpoints for real-time game data.
Key Methods:
get_live_scoreboard() # Today's games with live scores
get_game_boxscore(id) # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
Data Fields Returned:
- game_id, game_code
- home_team, away_team (tricodes)
- home_score, away_score
- period, clock
- status
6. prediction_tracker.py - Persistence (20KB, 508 lines)
Stores predictions and tracks accuracy using ChromaDB Cloud.
Features:
- ChromaDB Cloud integration (with local JSON fallback)
- Prediction storage before games start
- Result updating after games complete
- Comprehensive accuracy statistics
Key Methods:
save_prediction(game_id, prediction) # Store pre-game prediction
update_result(game_id, winner, scores) # Update with final result
get_accuracy_stats() # Overall, by confidence, by team
get_pending_predictions() # Awaiting results
7. models/game_predictor.py - ML Model (12KB, 332 lines)
XGBoost + LightGBM ensemble classifier.
Architecture:
Input Features βββ¬βββΊ XGBoost βββ
β ββββΊ Weighted Average βββΊ Win Probability
ββββΊ LightGBM ββ
(50/50 weight)
Key Methods:
train(X_train, y_train, X_val, y_val) # Train both models
predict_proba(X) # Get [loss_prob, win_prob]
predict_with_confidence(X) # Detailed prediction info
explain_prediction(X) # Feature importance for prediction
save() / load() # Persist to models/game_predictor.joblib
β οΈ NOTE: Model exists but predict_game() doesn't use it!
8. auto_trainer.py & continuous_learner.py - Auto Training
AutoTrainer (Singleton scheduler):
- Runs background loop checking for tasks
- Ingests completed games every hour
- Smart retraining: only after ALL daily games complete
- If new accuracy < old accuracy, reverts model
ContinuousLearner (Update workflow):
ingest_completed_games() βββΊ update_features() βββΊ retrain_model()
ποΈ Database & Storage
ChromaDB Cloud
- Purpose: Persistent prediction storage
- Credentials: Set via environment variables (
CHROMA_TENANT,CHROMA_DATABASE,CHROMA_API_KEY) - Fallback:
data/processed/predictions_local.json
Parquet Files
data/api_data/*.parquet- Cached API responsesdata/api_data/all_games_summary.parquet- Consolidated game history (41K+ games)
Joblib Files
models/game_predictor.joblib- Trained ML modeldata/processed/game_dataset.joblib- Processed training data
π Frontend Architecture
React + Vite with custom CSS design system.
Pages:
| Page | File | Purpose |
|---|---|---|
| Live Games | LiveGames.jsx |
Today's games, live scores, predictions |
| Predictions | Predictions.jsx |
Upcoming games with predictions |
| Head to Head | HeadToHead.jsx |
Compare two teams |
| Accuracy | Accuracy.jsx |
Model performance stats |
| MVP Race | MvpRace.jsx |
Current MVP standings |
| Championship | Championship.jsx |
Championship odds |
Key Frontend Components:
TeamLogo.jsx- Official NBA team logosapi.js- API client with base URL handlingindex.css- Complete design system (27KB)
π§ Configuration (src/config.py)
Critical Settings:
# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}
# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")
# Feature engineering
FEATURE_CONFIG = {
"rolling_windows": [5, 10, 20],
"min_games_for_features": 5
}
# ELO system
ELO_CONFIG = {
"initial_rating": 1500,
"k_factor": 20,
"home_advantage": 100
}
β οΈ Known Issues & Technical Debt
- ML Model Not Used:
predict_game()uses formula, not trainedGamePredictor - Season Hardcoding: Some places use
2025-26explicitly - Fallback Data: Pipeline has hardcoded rosters as backup
- Function Order:
warm_starter_cache()must be defined before scheduler calls it
π Deployment Notes
Hugging Face Spaces:
- Uses persistent
/datadirectory for storage - Dockerfile copies
models/anddata/api_data/ - Git LFS for large files (
.joblib,.parquet) - Port 7860 for HF Spaces
Environment Variables:
CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths
π Quick Reference: Common Tasks
Add new API endpoint:
- Add route in
server.py(production) ANDapi/api.py(development) - Add frontend call in
web/src/api.js - Create/update page component in
web/src/pages/
Modify prediction algorithm:
- Edit
PredictionPipeline.predict_game()inprediction_pipeline.py - Consider blending with
GamePredictormodel
Update ML model:
- Retrain via
ContinuousLearner.retrain_model() - Or trigger via
POST /api/admin/retrain
Add new feature:
- Add to
FeatureGeneratorinfeature_engineering.py - Update preprocessing pipeline
- Retrain model
Last updated: January 2026