# NBA Sage - Complete Codebase Context for AI Assistants > **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications. --- ## 🏗️ Project Architecture Overview ``` NBA ML/ ├── server.py # Main production server (Flask + React) ├── api/api.py # Development API server │ ├── src/ # Core Python modules │ ├── prediction_pipeline.py # Main prediction orchestrator │ ├── feature_engineering.py # ELO + feature generation │ ├── data_collector.py # Historical NBA API data │ ├── live_data_collector.py # Real-time game data │ ├── injury_collector.py # Player injury tracking │ ├── prediction_tracker.py # ChromaDB prediction storage │ ├── auto_trainer.py # Automated training scheduler │ ├── continuous_learner.py # Incremental model updates │ ├── preprocessing.py # Data preprocessing │ ├── config.py # Global configuration │ └── models/ │ ├── game_predictor.py # XGBoost+LightGBM ensemble │ ├── mvp_predictor.py # MVP prediction model │ └── championship_predictor.py │ ├── web/ # React Frontend │ └── src/ │ ├── App.jsx # Main app with sidebar navigation │ ├── pages/ # LiveGames, Predictions, MVP, etc. │ ├── api.js # API client │ └── index.css # Comprehensive CSS design system │ ├── data/ │ ├── api_data/ # Cached NBA API responses (parquet) │ ├── processed/ # Processed datasets (joblib) │ └── raw/ # Raw game data │ └── models/ └── game_predictor.joblib # Trained ML model ``` --- ## 🔄 Data Flow ``` NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training │ ▼ Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI │ ▼ Prediction Tracker (ChromaDB) ``` --- ## 📦 Core Components Deep Dive ### 1. `server.py` - Production Server (39KB, 929 lines) **Critical for Hugging Face deployment. Combines Flask API + React static serving.** **Key Sections:** - **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games - **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup - **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync - **API Endpoints (lines 400-860)**: All REST endpoints for frontend **Important Functions:** ```python warm_starter_cache() # Fetches real NBA API data for all teams startup_cache_warming() # Runs synchronously on server start auto_retrain_model() # Smart retraining after all daily games complete sync_prediction_results() # Updates prediction correctness from final scores update_elo_ratings() # Daily ELO recalculation ``` **Endpoints:** - `GET /api/live-games` - Today's games with predictions - `GET /api/roster/` - Team's projected starting 5 - `GET /api/accuracy` - Model accuracy statistics - `GET /api/mvp` - MVP race standings - `GET /api/championship` - Championship odds --- ### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines) **The heart of the system. Orchestrates all predictions.** **Key Properties:** ```python self.live_collector # LiveDataCollector instance self.injury_collector # InjuryCollector instance self.feature_gen # FeatureGenerator instance self.tracker # PredictionTracker (ChromaDB) self._game_model # Lazy-loaded GamePredictor ``` **Important Methods:** | Method | Purpose | |--------|---------| | `predict_game(home, away)` | Generate single game prediction | | `get_upcoming_games(days)` | Fetch future NBA schedule | | `get_mvp_race()` | Calculate MVP standings from live stats | | `get_championship_odds()` | Calculate championship probabilities | | `get_team_roster(team)` | Fast fallback roster data | **⚠️ CRITICAL: Prediction Algorithm (lines 349-504)** The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model: ```python # Weights in predict_game(): home_talent = ( 0.40 * home_win_pct + # Current season record 0.30 * home_form + # Last 10 games 0.20 * home_elo_strength + # Historical ELO 0.10 * 0.5 # Baseline ) # Plus: +3.5% home court, -2% per injury point # Uses Log5 formula for head-to-head probability ``` The trained `GamePredictor` model exists but is NOT called for live predictions. --- ### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines) **Contains ELO system and all feature generation logic.** **Classes:** | Class | Purpose | Key Methods | |-------|---------|-------------| | `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` | | `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` | | `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` | | `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` | **ELO Configuration:** ```python initial_rating = 1500 k_factor = 20 home_advantage = 100 # ELO points for home court regression_factor = 0.25 # Season regression to mean ``` **Feature Types Generated:** - ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win) - Rolling averages (5, 10, 20 game windows) - Rest days, back-to-back detection - Season record features - Head-to-head history --- ### 4. `data_collector.py` - Historical Data (27KB, 650 lines) **Collects comprehensive NBA data from official API.** **Classes:** | Class | Data Collected | |-------|---------------| | `GameDataCollector` | Game results per season | | `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) | | `PlayerDataCollector` | Player stats | | `CacheManager` | Parquet file caching | **Key Features:** - Exponential backoff retry for rate limiting - Per-season parquet caching - Checkpoint system for resumable collection --- ### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines) **Uses `nba_api.live` endpoints for real-time game data.** **Key Methods:** ```python get_live_scoreboard() # Today's games with live scores get_game_boxscore(id) # Detailed box score get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL ``` **Data Fields Returned:** - game_id, game_code - home_team, away_team (tricodes) - home_score, away_score - period, clock - status --- ### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines) **Stores predictions and tracks accuracy using ChromaDB Cloud.** **Features:** - ChromaDB Cloud integration (with local JSON fallback) - Prediction storage before games start - Result updating after games complete - Comprehensive accuracy statistics **Key Methods:** ```python save_prediction(game_id, prediction) # Store pre-game prediction update_result(game_id, winner, scores) # Update with final result get_accuracy_stats() # Overall, by confidence, by team get_pending_predictions() # Awaiting results ``` --- ### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines) **XGBoost + LightGBM ensemble classifier.** **Architecture:** ``` Input Features ──┬──► XGBoost ──┐ │ │──► Weighted Average ──► Win Probability └──► LightGBM ─┘ (50/50 weight) ``` **Key Methods:** ```python train(X_train, y_train, X_val, y_val) # Train both models predict_proba(X) # Get [loss_prob, win_prob] predict_with_confidence(X) # Detailed prediction info explain_prediction(X) # Feature importance for prediction save() / load() # Persist to models/game_predictor.joblib ``` **⚠️ NOTE: Model exists but `predict_game()` doesn't use it!** --- ### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training **AutoTrainer** (Singleton scheduler): - Runs background loop checking for tasks - Ingests completed games every hour - Smart retraining: only after ALL daily games complete - If new accuracy < old accuracy, reverts model **ContinuousLearner** (Update workflow): ``` ingest_completed_games() ──► update_features() ──► retrain_model() ``` --- ## 🗄️ Database & Storage ### ChromaDB Cloud - **Purpose**: Persistent prediction storage - **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`) - **Fallback**: `data/processed/predictions_local.json` ### Parquet Files - `data/api_data/*.parquet` - Cached API responses - `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games) ### Joblib Files - `models/game_predictor.joblib` - Trained ML model - `data/processed/game_dataset.joblib` - Processed training data --- ## 🌐 Frontend Architecture **React + Vite with custom CSS design system.** **Pages:** | Page | File | Purpose | |------|------|---------| | Live Games | `LiveGames.jsx` | Today's games, live scores, predictions | | Predictions | `Predictions.jsx` | Upcoming games with predictions | | Head to Head | `HeadToHead.jsx` | Compare two teams | | Accuracy | `Accuracy.jsx` | Model performance stats | | MVP Race | `MvpRace.jsx` | Current MVP standings | | Championship | `Championship.jsx` | Championship odds | **Key Frontend Components:** - `TeamLogo.jsx` - Official NBA team logos - `api.js` - API client with base URL handling - `index.css` - Complete design system (27KB) --- ## 🔧 Configuration (`src/config.py`) **Critical Settings:** ```python # NBA Teams mapping (team_id -> tricode) NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...} # Data paths API_CACHE_DIR = Path("data/api_data") PROCESSED_DATA_DIR = Path("data/processed") MODELS_DIR = Path("models") # Feature engineering FEATURE_CONFIG = { "rolling_windows": [5, 10, 20], "min_games_for_features": 5 } # ELO system ELO_CONFIG = { "initial_rating": 1500, "k_factor": 20, "home_advantage": 100 } ``` --- ## ⚠️ Known Issues & Technical Debt 1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor` 2. **Season Hardcoding**: Some places use `2025-26` explicitly 3. **Fallback Data**: Pipeline has hardcoded rosters as backup 4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it --- ## 🚀 Deployment Notes **Hugging Face Spaces:** - Uses persistent `/data` directory for storage - Dockerfile copies `models/` and `data/api_data/` - Git LFS for large files (`.joblib`, `.parquet`) - Port 7860 for HF Spaces **Environment Variables:** ``` CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths ``` --- ## 📋 Quick Reference: Common Tasks **Add new API endpoint:** 1. Add route in `server.py` (production) AND `api/api.py` (development) 2. Add frontend call in `web/src/api.js` 3. Create/update page component in `web/src/pages/` **Modify prediction algorithm:** 1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py` 2. Consider blending with `GamePredictor` model **Update ML model:** 1. Retrain via `ContinuousLearner.retrain_model()` 2. Or trigger via `POST /api/admin/retrain` **Add new feature:** 1. Add to `FeatureGenerator` in `feature_engineering.py` 2. Update preprocessing pipeline 3. Retrain model --- *Last updated: January 2026*