File size: 12,013 Bytes
3e6f1d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
# NBA Sage - Complete Codebase Context for AI Assistants

> **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.

---

## πŸ—οΈ Project Architecture Overview

```
NBA ML/
β”œβ”€β”€ server.py              # Main production server (Flask + React)
β”œβ”€β”€ api/api.py             # Development API server
β”‚
β”œβ”€β”€ src/                   # Core Python modules
β”‚   β”œβ”€β”€ prediction_pipeline.py   # Main prediction orchestrator
β”‚   β”œβ”€β”€ feature_engineering.py   # ELO + feature generation
β”‚   β”œβ”€β”€ data_collector.py        # Historical NBA API data
β”‚   β”œβ”€β”€ live_data_collector.py   # Real-time game data
β”‚   β”œβ”€β”€ injury_collector.py      # Player injury tracking
β”‚   β”œβ”€β”€ prediction_tracker.py    # ChromaDB prediction storage
β”‚   β”œβ”€β”€ auto_trainer.py          # Automated training scheduler
β”‚   β”œβ”€β”€ continuous_learner.py    # Incremental model updates
β”‚   β”œβ”€β”€ preprocessing.py         # Data preprocessing
β”‚   β”œβ”€β”€ config.py                # Global configuration
β”‚   └── models/
β”‚       β”œβ”€β”€ game_predictor.py    # XGBoost+LightGBM ensemble
β”‚       β”œβ”€β”€ mvp_predictor.py     # MVP prediction model
β”‚       └── championship_predictor.py
β”‚
β”œβ”€β”€ web/                   # React Frontend
β”‚   └── src/
β”‚       β”œβ”€β”€ App.jsx        # Main app with sidebar navigation
β”‚       β”œβ”€β”€ pages/         # LiveGames, Predictions, MVP, etc.
β”‚       β”œβ”€β”€ api.js         # API client
β”‚       └── index.css      # Comprehensive CSS design system
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ api_data/          # Cached NBA API responses (parquet)
β”‚   β”œβ”€β”€ processed/         # Processed datasets (joblib)
β”‚   └── raw/               # Raw game data
β”‚
└── models/
    └── game_predictor.joblib  # Trained ML model
```

---

## πŸ”„ Data Flow

```
NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
                                       β”‚
                                       β–Ό
Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
                                       β”‚
                                       β–Ό
                              Prediction Tracker (ChromaDB)
```

---

## πŸ“¦ Core Components Deep Dive

### 1. `server.py` - Production Server (39KB, 929 lines)

**Critical for Hugging Face deployment. Combines Flask API + React static serving.**

**Key Sections:**
- **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games
- **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup
- **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync
- **API Endpoints (lines 400-860)**: All REST endpoints for frontend

**Important Functions:**
```python
warm_starter_cache()      # Fetches real NBA API data for all teams
startup_cache_warming()   # Runs synchronously on server start
auto_retrain_model()      # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings()      # Daily ELO recalculation
```

**Endpoints:**
- `GET /api/live-games` - Today's games with predictions
- `GET /api/roster/<team>` - Team's projected starting 5
- `GET /api/accuracy` - Model accuracy statistics
- `GET /api/mvp` - MVP race standings
- `GET /api/championship` - Championship odds

---

### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)

**The heart of the system. Orchestrates all predictions.**

**Key Properties:**
```python
self.live_collector      # LiveDataCollector instance
self.injury_collector    # InjuryCollector instance
self.feature_gen         # FeatureGenerator instance
self.tracker             # PredictionTracker (ChromaDB)
self._game_model         # Lazy-loaded GamePredictor
```

**Important Methods:**

| Method | Purpose |
|--------|---------|
| `predict_game(home, away)` | Generate single game prediction |
| `get_upcoming_games(days)` | Fetch future NBA schedule |
| `get_mvp_race()` | Calculate MVP standings from live stats |
| `get_championship_odds()` | Calculate championship probabilities |
| `get_team_roster(team)` | Fast fallback roster data |

**⚠️ CRITICAL: Prediction Algorithm (lines 349-504)**

The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model:

```python
# Weights in predict_game():
home_talent = (
    0.40 * home_win_pct +      # Current season record
    0.30 * home_form +          # Last 10 games
    0.20 * home_elo_strength +  # Historical ELO
    0.10 * 0.5                  # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability
```

The trained `GamePredictor` model exists but is NOT called for live predictions.

---

### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)

**Contains ELO system and all feature generation logic.**

**Classes:**

| Class | Purpose | Key Methods |
|-------|---------|-------------|
| `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` |
| `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` |
| `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` |
| `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` |

**ELO Configuration:**
```python
initial_rating = 1500
k_factor = 20
home_advantage = 100  # ELO points for home court
regression_factor = 0.25  # Season regression to mean
```

**Feature Types Generated:**
- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
- Rolling averages (5, 10, 20 game windows)
- Rest days, back-to-back detection
- Season record features
- Head-to-head history

---

### 4. `data_collector.py` - Historical Data (27KB, 650 lines)

**Collects comprehensive NBA data from official API.**

**Classes:**
| Class | Data Collected |
|-------|---------------|
| `GameDataCollector` | Game results per season |
| `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) |
| `PlayerDataCollector` | Player stats |
| `CacheManager` | Parquet file caching |

**Key Features:**
- Exponential backoff retry for rate limiting
- Per-season parquet caching
- Checkpoint system for resumable collection

---

### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)

**Uses `nba_api.live` endpoints for real-time game data.**

**Key Methods:**
```python
get_live_scoreboard()  # Today's games with live scores
get_game_boxscore(id)  # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
```

**Data Fields Returned:**
- game_id, game_code
- home_team, away_team (tricodes)
- home_score, away_score
- period, clock
- status

---

### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines)

**Stores predictions and tracks accuracy using ChromaDB Cloud.**

**Features:**
- ChromaDB Cloud integration (with local JSON fallback)
- Prediction storage before games start
- Result updating after games complete
- Comprehensive accuracy statistics

**Key Methods:**
```python
save_prediction(game_id, prediction)  # Store pre-game prediction
update_result(game_id, winner, scores)  # Update with final result
get_accuracy_stats()  # Overall, by confidence, by team
get_pending_predictions()  # Awaiting results
```

---

### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines)

**XGBoost + LightGBM ensemble classifier.**

**Architecture:**
```
Input Features ──┬──► XGBoost ──┐
                 β”‚              │──► Weighted Average ──► Win Probability
                 └──► LightGBM β”€β”˜
                      (50/50 weight)
```

**Key Methods:**
```python
train(X_train, y_train, X_val, y_val)  # Train both models
predict_proba(X)  # Get [loss_prob, win_prob]
predict_with_confidence(X)  # Detailed prediction info
explain_prediction(X)  # Feature importance for prediction
save() / load()  # Persist to models/game_predictor.joblib
```

**⚠️ NOTE: Model exists but `predict_game()` doesn't use it!**

---

### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training

**AutoTrainer** (Singleton scheduler):
- Runs background loop checking for tasks
- Ingests completed games every hour
- Smart retraining: only after ALL daily games complete
- If new accuracy < old accuracy, reverts model

**ContinuousLearner** (Update workflow):
```
ingest_completed_games() ──► update_features() ──► retrain_model()
```

---

## πŸ—„οΈ Database & Storage

### ChromaDB Cloud
- **Purpose**: Persistent prediction storage
- **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`)
- **Fallback**: `data/processed/predictions_local.json`

### Parquet Files
- `data/api_data/*.parquet` - Cached API responses
- `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games)

### Joblib Files
- `models/game_predictor.joblib` - Trained ML model
- `data/processed/game_dataset.joblib` - Processed training data

---

## 🌐 Frontend Architecture

**React + Vite with custom CSS design system.**

**Pages:**
| Page | File | Purpose |
|------|------|---------|
| Live Games | `LiveGames.jsx` | Today's games, live scores, predictions |
| Predictions | `Predictions.jsx` | Upcoming games with predictions |
| Head to Head | `HeadToHead.jsx` | Compare two teams |
| Accuracy | `Accuracy.jsx` | Model performance stats |
| MVP Race | `MvpRace.jsx` | Current MVP standings |
| Championship | `Championship.jsx` | Championship odds |

**Key Frontend Components:**
- `TeamLogo.jsx` - Official NBA team logos
- `api.js` - API client with base URL handling
- `index.css` - Complete design system (27KB)

---

## πŸ”§ Configuration (`src/config.py`)

**Critical Settings:**
```python
# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}

# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")

# Feature engineering
FEATURE_CONFIG = {
    "rolling_windows": [5, 10, 20],
    "min_games_for_features": 5
}

# ELO system
ELO_CONFIG = {
    "initial_rating": 1500,
    "k_factor": 20,
    "home_advantage": 100
}
```

---

## ⚠️ Known Issues & Technical Debt

1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor`
2. **Season Hardcoding**: Some places use `2025-26` explicitly
3. **Fallback Data**: Pipeline has hardcoded rosters as backup
4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it

---

## πŸš€ Deployment Notes

**Hugging Face Spaces:**
- Uses persistent `/data` directory for storage
- Dockerfile copies `models/` and `data/api_data/` 
- Git LFS for large files (`.joblib`, `.parquet`)
- Port 7860 for HF Spaces

**Environment Variables:**
```
CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY  # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR  # Override paths
```

---

## πŸ“‹ Quick Reference: Common Tasks

**Add new API endpoint:**
1. Add route in `server.py` (production) AND `api/api.py` (development)
2. Add frontend call in `web/src/api.js`
3. Create/update page component in `web/src/pages/`

**Modify prediction algorithm:**
1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py`
2. Consider blending with `GamePredictor` model

**Update ML model:**
1. Retrain via `ContinuousLearner.retrain_model()`
2. Or trigger via `POST /api/admin/retrain`

**Add new feature:**
1. Add to `FeatureGenerator` in `feature_engineering.py`
2. Update preprocessing pipeline
3. Retrain model

---

*Last updated: January 2026*