Spaces:
Running
Running
| # NBA Sage - Complete Codebase Context for AI Assistants | |
| > **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications. | |
| --- | |
| ## ποΈ Project Architecture Overview | |
| ``` | |
| NBA ML/ | |
| βββ server.py # Main production server (Flask + React) | |
| βββ api/api.py # Development API server | |
| β | |
| βββ src/ # Core Python modules | |
| β βββ prediction_pipeline.py # Main prediction orchestrator | |
| β βββ feature_engineering.py # ELO + feature generation | |
| β βββ data_collector.py # Historical NBA API data | |
| β βββ live_data_collector.py # Real-time game data | |
| β βββ injury_collector.py # Player injury tracking | |
| β βββ prediction_tracker.py # ChromaDB prediction storage | |
| β βββ auto_trainer.py # Automated training scheduler | |
| β βββ continuous_learner.py # Incremental model updates | |
| β βββ preprocessing.py # Data preprocessing | |
| β βββ config.py # Global configuration | |
| β βββ models/ | |
| β βββ game_predictor.py # XGBoost+LightGBM ensemble | |
| β βββ mvp_predictor.py # MVP prediction model | |
| β βββ championship_predictor.py | |
| β | |
| βββ web/ # React Frontend | |
| β βββ src/ | |
| β βββ App.jsx # Main app with sidebar navigation | |
| β βββ pages/ # LiveGames, Predictions, MVP, etc. | |
| β βββ api.js # API client | |
| β βββ index.css # Comprehensive CSS design system | |
| β | |
| βββ data/ | |
| β βββ api_data/ # Cached NBA API responses (parquet) | |
| β βββ processed/ # Processed datasets (joblib) | |
| β βββ raw/ # Raw game data | |
| β | |
| βββ models/ | |
| βββ game_predictor.joblib # Trained ML model | |
| ``` | |
| --- | |
| ## π Data Flow | |
| ``` | |
| NBA API βββΊ Data Collectors βββΊ Feature Engineering βββΊ ML Training | |
| β | |
| βΌ | |
| Live API βββΊ Live Collector βββΊ Prediction Pipeline βββΊ Flask API βββΊ React UI | |
| β | |
| βΌ | |
| Prediction Tracker (ChromaDB) | |
| ``` | |
| --- | |
| ## π¦ Core Components Deep Dive | |
| ### 1. `server.py` - Production Server (39KB, 929 lines) | |
| **Critical for Hugging Face deployment. Combines Flask API + React static serving.** | |
| **Key Sections:** | |
| - **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games | |
| - **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup | |
| - **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync | |
| - **API Endpoints (lines 400-860)**: All REST endpoints for frontend | |
| **Important Functions:** | |
| ```python | |
| warm_starter_cache() # Fetches real NBA API data for all teams | |
| startup_cache_warming() # Runs synchronously on server start | |
| auto_retrain_model() # Smart retraining after all daily games complete | |
| sync_prediction_results() # Updates prediction correctness from final scores | |
| update_elo_ratings() # Daily ELO recalculation | |
| ``` | |
| **Endpoints:** | |
| - `GET /api/live-games` - Today's games with predictions | |
| - `GET /api/roster/<team>` - Team's projected starting 5 | |
| - `GET /api/accuracy` - Model accuracy statistics | |
| - `GET /api/mvp` - MVP race standings | |
| - `GET /api/championship` - Championship odds | |
| --- | |
| ### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines) | |
| **The heart of the system. Orchestrates all predictions.** | |
| **Key Properties:** | |
| ```python | |
| self.live_collector # LiveDataCollector instance | |
| self.injury_collector # InjuryCollector instance | |
| self.feature_gen # FeatureGenerator instance | |
| self.tracker # PredictionTracker (ChromaDB) | |
| self._game_model # Lazy-loaded GamePredictor | |
| ``` | |
| **Important Methods:** | |
| | Method | Purpose | | |
| |--------|---------| | |
| | `predict_game(home, away)` | Generate single game prediction | | |
| | `get_upcoming_games(days)` | Fetch future NBA schedule | | |
| | `get_mvp_race()` | Calculate MVP standings from live stats | | |
| | `get_championship_odds()` | Calculate championship probabilities | | |
| | `get_team_roster(team)` | Fast fallback roster data | | |
| **β οΈ CRITICAL: Prediction Algorithm (lines 349-504)** | |
| The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model: | |
| ```python | |
| # Weights in predict_game(): | |
| home_talent = ( | |
| 0.40 * home_win_pct + # Current season record | |
| 0.30 * home_form + # Last 10 games | |
| 0.20 * home_elo_strength + # Historical ELO | |
| 0.10 * 0.5 # Baseline | |
| ) | |
| # Plus: +3.5% home court, -2% per injury point | |
| # Uses Log5 formula for head-to-head probability | |
| ``` | |
| The trained `GamePredictor` model exists but is NOT called for live predictions. | |
| --- | |
| ### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines) | |
| **Contains ELO system and all feature generation logic.** | |
| **Classes:** | |
| | Class | Purpose | Key Methods | | |
| |-------|---------|-------------| | |
| | `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` | | |
| | `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` | | |
| | `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` | | |
| | `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` | | |
| **ELO Configuration:** | |
| ```python | |
| initial_rating = 1500 | |
| k_factor = 20 | |
| home_advantage = 100 # ELO points for home court | |
| regression_factor = 0.25 # Season regression to mean | |
| ``` | |
| **Feature Types Generated:** | |
| - ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win) | |
| - Rolling averages (5, 10, 20 game windows) | |
| - Rest days, back-to-back detection | |
| - Season record features | |
| - Head-to-head history | |
| --- | |
| ### 4. `data_collector.py` - Historical Data (27KB, 650 lines) | |
| **Collects comprehensive NBA data from official API.** | |
| **Classes:** | |
| | Class | Data Collected | | |
| |-------|---------------| | |
| | `GameDataCollector` | Game results per season | | |
| | `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) | | |
| | `PlayerDataCollector` | Player stats | | |
| | `CacheManager` | Parquet file caching | | |
| **Key Features:** | |
| - Exponential backoff retry for rate limiting | |
| - Per-season parquet caching | |
| - Checkpoint system for resumable collection | |
| --- | |
| ### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines) | |
| **Uses `nba_api.live` endpoints for real-time game data.** | |
| **Key Methods:** | |
| ```python | |
| get_live_scoreboard() # Today's games with live scores | |
| get_game_boxscore(id) # Detailed box score | |
| get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL | |
| ``` | |
| **Data Fields Returned:** | |
| - game_id, game_code | |
| - home_team, away_team (tricodes) | |
| - home_score, away_score | |
| - period, clock | |
| - status | |
| --- | |
| ### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines) | |
| **Stores predictions and tracks accuracy using ChromaDB Cloud.** | |
| **Features:** | |
| - ChromaDB Cloud integration (with local JSON fallback) | |
| - Prediction storage before games start | |
| - Result updating after games complete | |
| - Comprehensive accuracy statistics | |
| **Key Methods:** | |
| ```python | |
| save_prediction(game_id, prediction) # Store pre-game prediction | |
| update_result(game_id, winner, scores) # Update with final result | |
| get_accuracy_stats() # Overall, by confidence, by team | |
| get_pending_predictions() # Awaiting results | |
| ``` | |
| --- | |
| ### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines) | |
| **XGBoost + LightGBM ensemble classifier.** | |
| **Architecture:** | |
| ``` | |
| Input Features βββ¬βββΊ XGBoost βββ | |
| β ββββΊ Weighted Average βββΊ Win Probability | |
| ββββΊ LightGBM ββ | |
| (50/50 weight) | |
| ``` | |
| **Key Methods:** | |
| ```python | |
| train(X_train, y_train, X_val, y_val) # Train both models | |
| predict_proba(X) # Get [loss_prob, win_prob] | |
| predict_with_confidence(X) # Detailed prediction info | |
| explain_prediction(X) # Feature importance for prediction | |
| save() / load() # Persist to models/game_predictor.joblib | |
| ``` | |
| **β οΈ NOTE: Model exists but `predict_game()` doesn't use it!** | |
| --- | |
| ### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training | |
| **AutoTrainer** (Singleton scheduler): | |
| - Runs background loop checking for tasks | |
| - Ingests completed games every hour | |
| - Smart retraining: only after ALL daily games complete | |
| - If new accuracy < old accuracy, reverts model | |
| **ContinuousLearner** (Update workflow): | |
| ``` | |
| ingest_completed_games() βββΊ update_features() βββΊ retrain_model() | |
| ``` | |
| --- | |
| ## ποΈ Database & Storage | |
| ### ChromaDB Cloud | |
| - **Purpose**: Persistent prediction storage | |
| - **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`) | |
| - **Fallback**: `data/processed/predictions_local.json` | |
| ### Parquet Files | |
| - `data/api_data/*.parquet` - Cached API responses | |
| - `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games) | |
| ### Joblib Files | |
| - `models/game_predictor.joblib` - Trained ML model | |
| - `data/processed/game_dataset.joblib` - Processed training data | |
| --- | |
| ## π Frontend Architecture | |
| **React + Vite with custom CSS design system.** | |
| **Pages:** | |
| | Page | File | Purpose | | |
| |------|------|---------| | |
| | Live Games | `LiveGames.jsx` | Today's games, live scores, predictions | | |
| | Predictions | `Predictions.jsx` | Upcoming games with predictions | | |
| | Head to Head | `HeadToHead.jsx` | Compare two teams | | |
| | Accuracy | `Accuracy.jsx` | Model performance stats | | |
| | MVP Race | `MvpRace.jsx` | Current MVP standings | | |
| | Championship | `Championship.jsx` | Championship odds | | |
| **Key Frontend Components:** | |
| - `TeamLogo.jsx` - Official NBA team logos | |
| - `api.js` - API client with base URL handling | |
| - `index.css` - Complete design system (27KB) | |
| --- | |
| ## π§ Configuration (`src/config.py`) | |
| **Critical Settings:** | |
| ```python | |
| # NBA Teams mapping (team_id -> tricode) | |
| NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...} | |
| # Data paths | |
| API_CACHE_DIR = Path("data/api_data") | |
| PROCESSED_DATA_DIR = Path("data/processed") | |
| MODELS_DIR = Path("models") | |
| # Feature engineering | |
| FEATURE_CONFIG = { | |
| "rolling_windows": [5, 10, 20], | |
| "min_games_for_features": 5 | |
| } | |
| # ELO system | |
| ELO_CONFIG = { | |
| "initial_rating": 1500, | |
| "k_factor": 20, | |
| "home_advantage": 100 | |
| } | |
| ``` | |
| --- | |
| ## β οΈ Known Issues & Technical Debt | |
| 1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor` | |
| 2. **Season Hardcoding**: Some places use `2025-26` explicitly | |
| 3. **Fallback Data**: Pipeline has hardcoded rosters as backup | |
| 4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it | |
| --- | |
| ## π Deployment Notes | |
| **Hugging Face Spaces:** | |
| - Uses persistent `/data` directory for storage | |
| - Dockerfile copies `models/` and `data/api_data/` | |
| - Git LFS for large files (`.joblib`, `.parquet`) | |
| - Port 7860 for HF Spaces | |
| **Environment Variables:** | |
| ``` | |
| CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB | |
| NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths | |
| ``` | |
| --- | |
| ## π Quick Reference: Common Tasks | |
| **Add new API endpoint:** | |
| 1. Add route in `server.py` (production) AND `api/api.py` (development) | |
| 2. Add frontend call in `web/src/api.js` | |
| 3. Create/update page component in `web/src/pages/` | |
| **Modify prediction algorithm:** | |
| 1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py` | |
| 2. Consider blending with `GamePredictor` model | |
| **Update ML model:** | |
| 1. Retrain via `ContinuousLearner.retrain_model()` | |
| 2. Or trigger via `POST /api/admin/retrain` | |
| **Add new feature:** | |
| 1. Add to `FeatureGenerator` in `feature_engineering.py` | |
| 2. Update preprocessing pipeline | |
| 3. Retrain model | |
| --- | |
| *Last updated: January 2026* | |