Spaces:
Running
Running
File size: 12,013 Bytes
3e6f1d3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 | # NBA Sage - Complete Codebase Context for AI Assistants
> **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.
---
## ποΈ Project Architecture Overview
```
NBA ML/
βββ server.py # Main production server (Flask + React)
βββ api/api.py # Development API server
β
βββ src/ # Core Python modules
β βββ prediction_pipeline.py # Main prediction orchestrator
β βββ feature_engineering.py # ELO + feature generation
β βββ data_collector.py # Historical NBA API data
β βββ live_data_collector.py # Real-time game data
β βββ injury_collector.py # Player injury tracking
β βββ prediction_tracker.py # ChromaDB prediction storage
β βββ auto_trainer.py # Automated training scheduler
β βββ continuous_learner.py # Incremental model updates
β βββ preprocessing.py # Data preprocessing
β βββ config.py # Global configuration
β βββ models/
β βββ game_predictor.py # XGBoost+LightGBM ensemble
β βββ mvp_predictor.py # MVP prediction model
β βββ championship_predictor.py
β
βββ web/ # React Frontend
β βββ src/
β βββ App.jsx # Main app with sidebar navigation
β βββ pages/ # LiveGames, Predictions, MVP, etc.
β βββ api.js # API client
β βββ index.css # Comprehensive CSS design system
β
βββ data/
β βββ api_data/ # Cached NBA API responses (parquet)
β βββ processed/ # Processed datasets (joblib)
β βββ raw/ # Raw game data
β
βββ models/
βββ game_predictor.joblib # Trained ML model
```
---
## π Data Flow
```
NBA API βββΊ Data Collectors βββΊ Feature Engineering βββΊ ML Training
β
βΌ
Live API βββΊ Live Collector βββΊ Prediction Pipeline βββΊ Flask API βββΊ React UI
β
βΌ
Prediction Tracker (ChromaDB)
```
---
## π¦ Core Components Deep Dive
### 1. `server.py` - Production Server (39KB, 929 lines)
**Critical for Hugging Face deployment. Combines Flask API + React static serving.**
**Key Sections:**
- **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games
- **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup
- **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync
- **API Endpoints (lines 400-860)**: All REST endpoints for frontend
**Important Functions:**
```python
warm_starter_cache() # Fetches real NBA API data for all teams
startup_cache_warming() # Runs synchronously on server start
auto_retrain_model() # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings() # Daily ELO recalculation
```
**Endpoints:**
- `GET /api/live-games` - Today's games with predictions
- `GET /api/roster/<team>` - Team's projected starting 5
- `GET /api/accuracy` - Model accuracy statistics
- `GET /api/mvp` - MVP race standings
- `GET /api/championship` - Championship odds
---
### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)
**The heart of the system. Orchestrates all predictions.**
**Key Properties:**
```python
self.live_collector # LiveDataCollector instance
self.injury_collector # InjuryCollector instance
self.feature_gen # FeatureGenerator instance
self.tracker # PredictionTracker (ChromaDB)
self._game_model # Lazy-loaded GamePredictor
```
**Important Methods:**
| Method | Purpose |
|--------|---------|
| `predict_game(home, away)` | Generate single game prediction |
| `get_upcoming_games(days)` | Fetch future NBA schedule |
| `get_mvp_race()` | Calculate MVP standings from live stats |
| `get_championship_odds()` | Calculate championship probabilities |
| `get_team_roster(team)` | Fast fallback roster data |
**β οΈ CRITICAL: Prediction Algorithm (lines 349-504)**
The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model:
```python
# Weights in predict_game():
home_talent = (
0.40 * home_win_pct + # Current season record
0.30 * home_form + # Last 10 games
0.20 * home_elo_strength + # Historical ELO
0.10 * 0.5 # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability
```
The trained `GamePredictor` model exists but is NOT called for live predictions.
---
### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)
**Contains ELO system and all feature generation logic.**
**Classes:**
| Class | Purpose | Key Methods |
|-------|---------|-------------|
| `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` |
| `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` |
| `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` |
| `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` |
**ELO Configuration:**
```python
initial_rating = 1500
k_factor = 20
home_advantage = 100 # ELO points for home court
regression_factor = 0.25 # Season regression to mean
```
**Feature Types Generated:**
- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
- Rolling averages (5, 10, 20 game windows)
- Rest days, back-to-back detection
- Season record features
- Head-to-head history
---
### 4. `data_collector.py` - Historical Data (27KB, 650 lines)
**Collects comprehensive NBA data from official API.**
**Classes:**
| Class | Data Collected |
|-------|---------------|
| `GameDataCollector` | Game results per season |
| `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) |
| `PlayerDataCollector` | Player stats |
| `CacheManager` | Parquet file caching |
**Key Features:**
- Exponential backoff retry for rate limiting
- Per-season parquet caching
- Checkpoint system for resumable collection
---
### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)
**Uses `nba_api.live` endpoints for real-time game data.**
**Key Methods:**
```python
get_live_scoreboard() # Today's games with live scores
get_game_boxscore(id) # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
```
**Data Fields Returned:**
- game_id, game_code
- home_team, away_team (tricodes)
- home_score, away_score
- period, clock
- status
---
### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines)
**Stores predictions and tracks accuracy using ChromaDB Cloud.**
**Features:**
- ChromaDB Cloud integration (with local JSON fallback)
- Prediction storage before games start
- Result updating after games complete
- Comprehensive accuracy statistics
**Key Methods:**
```python
save_prediction(game_id, prediction) # Store pre-game prediction
update_result(game_id, winner, scores) # Update with final result
get_accuracy_stats() # Overall, by confidence, by team
get_pending_predictions() # Awaiting results
```
---
### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines)
**XGBoost + LightGBM ensemble classifier.**
**Architecture:**
```
Input Features βββ¬βββΊ XGBoost βββ
β ββββΊ Weighted Average βββΊ Win Probability
ββββΊ LightGBM ββ
(50/50 weight)
```
**Key Methods:**
```python
train(X_train, y_train, X_val, y_val) # Train both models
predict_proba(X) # Get [loss_prob, win_prob]
predict_with_confidence(X) # Detailed prediction info
explain_prediction(X) # Feature importance for prediction
save() / load() # Persist to models/game_predictor.joblib
```
**β οΈ NOTE: Model exists but `predict_game()` doesn't use it!**
---
### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training
**AutoTrainer** (Singleton scheduler):
- Runs background loop checking for tasks
- Ingests completed games every hour
- Smart retraining: only after ALL daily games complete
- If new accuracy < old accuracy, reverts model
**ContinuousLearner** (Update workflow):
```
ingest_completed_games() βββΊ update_features() βββΊ retrain_model()
```
---
## ποΈ Database & Storage
### ChromaDB Cloud
- **Purpose**: Persistent prediction storage
- **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`)
- **Fallback**: `data/processed/predictions_local.json`
### Parquet Files
- `data/api_data/*.parquet` - Cached API responses
- `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games)
### Joblib Files
- `models/game_predictor.joblib` - Trained ML model
- `data/processed/game_dataset.joblib` - Processed training data
---
## π Frontend Architecture
**React + Vite with custom CSS design system.**
**Pages:**
| Page | File | Purpose |
|------|------|---------|
| Live Games | `LiveGames.jsx` | Today's games, live scores, predictions |
| Predictions | `Predictions.jsx` | Upcoming games with predictions |
| Head to Head | `HeadToHead.jsx` | Compare two teams |
| Accuracy | `Accuracy.jsx` | Model performance stats |
| MVP Race | `MvpRace.jsx` | Current MVP standings |
| Championship | `Championship.jsx` | Championship odds |
**Key Frontend Components:**
- `TeamLogo.jsx` - Official NBA team logos
- `api.js` - API client with base URL handling
- `index.css` - Complete design system (27KB)
---
## π§ Configuration (`src/config.py`)
**Critical Settings:**
```python
# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}
# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")
# Feature engineering
FEATURE_CONFIG = {
"rolling_windows": [5, 10, 20],
"min_games_for_features": 5
}
# ELO system
ELO_CONFIG = {
"initial_rating": 1500,
"k_factor": 20,
"home_advantage": 100
}
```
---
## β οΈ Known Issues & Technical Debt
1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor`
2. **Season Hardcoding**: Some places use `2025-26` explicitly
3. **Fallback Data**: Pipeline has hardcoded rosters as backup
4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it
---
## π Deployment Notes
**Hugging Face Spaces:**
- Uses persistent `/data` directory for storage
- Dockerfile copies `models/` and `data/api_data/`
- Git LFS for large files (`.joblib`, `.parquet`)
- Port 7860 for HF Spaces
**Environment Variables:**
```
CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths
```
---
## π Quick Reference: Common Tasks
**Add new API endpoint:**
1. Add route in `server.py` (production) AND `api/api.py` (development)
2. Add frontend call in `web/src/api.js`
3. Create/update page component in `web/src/pages/`
**Modify prediction algorithm:**
1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py`
2. Consider blending with `GamePredictor` model
**Update ML model:**
1. Retrain via `ContinuousLearner.retrain_model()`
2. Or trigger via `POST /api/admin/retrain`
**Add new feature:**
1. Add to `FeatureGenerator` in `feature_engineering.py`
2. Update preprocessing pipeline
3. Retrain model
---
*Last updated: January 2026*
|