NBA_PREDICTOR / claude.md
jashdoshi77's picture
Add analytics, confidence meter, enhanced H2H, daily MVP refresh
3e6f1d3
# NBA Sage - Complete Codebase Context for AI Assistants
> **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.
---
## πŸ—οΈ Project Architecture Overview
```
NBA ML/
β”œβ”€β”€ server.py # Main production server (Flask + React)
β”œβ”€β”€ api/api.py # Development API server
β”‚
β”œβ”€β”€ src/ # Core Python modules
β”‚ β”œβ”€β”€ prediction_pipeline.py # Main prediction orchestrator
β”‚ β”œβ”€β”€ feature_engineering.py # ELO + feature generation
β”‚ β”œβ”€β”€ data_collector.py # Historical NBA API data
β”‚ β”œβ”€β”€ live_data_collector.py # Real-time game data
β”‚ β”œβ”€β”€ injury_collector.py # Player injury tracking
β”‚ β”œβ”€β”€ prediction_tracker.py # ChromaDB prediction storage
β”‚ β”œβ”€β”€ auto_trainer.py # Automated training scheduler
β”‚ β”œβ”€β”€ continuous_learner.py # Incremental model updates
β”‚ β”œβ”€β”€ preprocessing.py # Data preprocessing
β”‚ β”œβ”€β”€ config.py # Global configuration
β”‚ └── models/
β”‚ β”œβ”€β”€ game_predictor.py # XGBoost+LightGBM ensemble
β”‚ β”œβ”€β”€ mvp_predictor.py # MVP prediction model
β”‚ └── championship_predictor.py
β”‚
β”œβ”€β”€ web/ # React Frontend
β”‚ └── src/
β”‚ β”œβ”€β”€ App.jsx # Main app with sidebar navigation
β”‚ β”œβ”€β”€ pages/ # LiveGames, Predictions, MVP, etc.
β”‚ β”œβ”€β”€ api.js # API client
β”‚ └── index.css # Comprehensive CSS design system
β”‚
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ api_data/ # Cached NBA API responses (parquet)
β”‚ β”œβ”€β”€ processed/ # Processed datasets (joblib)
β”‚ └── raw/ # Raw game data
β”‚
└── models/
└── game_predictor.joblib # Trained ML model
```
---
## πŸ”„ Data Flow
```
NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
β”‚
β–Ό
Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
β”‚
β–Ό
Prediction Tracker (ChromaDB)
```
---
## πŸ“¦ Core Components Deep Dive
### 1. `server.py` - Production Server (39KB, 929 lines)
**Critical for Hugging Face deployment. Combines Flask API + React static serving.**
**Key Sections:**
- **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games
- **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup
- **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync
- **API Endpoints (lines 400-860)**: All REST endpoints for frontend
**Important Functions:**
```python
warm_starter_cache() # Fetches real NBA API data for all teams
startup_cache_warming() # Runs synchronously on server start
auto_retrain_model() # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings() # Daily ELO recalculation
```
**Endpoints:**
- `GET /api/live-games` - Today's games with predictions
- `GET /api/roster/<team>` - Team's projected starting 5
- `GET /api/accuracy` - Model accuracy statistics
- `GET /api/mvp` - MVP race standings
- `GET /api/championship` - Championship odds
---
### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)
**The heart of the system. Orchestrates all predictions.**
**Key Properties:**
```python
self.live_collector # LiveDataCollector instance
self.injury_collector # InjuryCollector instance
self.feature_gen # FeatureGenerator instance
self.tracker # PredictionTracker (ChromaDB)
self._game_model # Lazy-loaded GamePredictor
```
**Important Methods:**
| Method | Purpose |
|--------|---------|
| `predict_game(home, away)` | Generate single game prediction |
| `get_upcoming_games(days)` | Fetch future NBA schedule |
| `get_mvp_race()` | Calculate MVP standings from live stats |
| `get_championship_odds()` | Calculate championship probabilities |
| `get_team_roster(team)` | Fast fallback roster data |
**⚠️ CRITICAL: Prediction Algorithm (lines 349-504)**
The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model:
```python
# Weights in predict_game():
home_talent = (
0.40 * home_win_pct + # Current season record
0.30 * home_form + # Last 10 games
0.20 * home_elo_strength + # Historical ELO
0.10 * 0.5 # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability
```
The trained `GamePredictor` model exists but is NOT called for live predictions.
---
### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)
**Contains ELO system and all feature generation logic.**
**Classes:**
| Class | Purpose | Key Methods |
|-------|---------|-------------|
| `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` |
| `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` |
| `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` |
| `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` |
**ELO Configuration:**
```python
initial_rating = 1500
k_factor = 20
home_advantage = 100 # ELO points for home court
regression_factor = 0.25 # Season regression to mean
```
**Feature Types Generated:**
- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
- Rolling averages (5, 10, 20 game windows)
- Rest days, back-to-back detection
- Season record features
- Head-to-head history
---
### 4. `data_collector.py` - Historical Data (27KB, 650 lines)
**Collects comprehensive NBA data from official API.**
**Classes:**
| Class | Data Collected |
|-------|---------------|
| `GameDataCollector` | Game results per season |
| `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) |
| `PlayerDataCollector` | Player stats |
| `CacheManager` | Parquet file caching |
**Key Features:**
- Exponential backoff retry for rate limiting
- Per-season parquet caching
- Checkpoint system for resumable collection
---
### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)
**Uses `nba_api.live` endpoints for real-time game data.**
**Key Methods:**
```python
get_live_scoreboard() # Today's games with live scores
get_game_boxscore(id) # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
```
**Data Fields Returned:**
- game_id, game_code
- home_team, away_team (tricodes)
- home_score, away_score
- period, clock
- status
---
### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines)
**Stores predictions and tracks accuracy using ChromaDB Cloud.**
**Features:**
- ChromaDB Cloud integration (with local JSON fallback)
- Prediction storage before games start
- Result updating after games complete
- Comprehensive accuracy statistics
**Key Methods:**
```python
save_prediction(game_id, prediction) # Store pre-game prediction
update_result(game_id, winner, scores) # Update with final result
get_accuracy_stats() # Overall, by confidence, by team
get_pending_predictions() # Awaiting results
```
---
### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines)
**XGBoost + LightGBM ensemble classifier.**
**Architecture:**
```
Input Features ──┬──► XGBoost ──┐
β”‚ │──► Weighted Average ──► Win Probability
└──► LightGBM β”€β”˜
(50/50 weight)
```
**Key Methods:**
```python
train(X_train, y_train, X_val, y_val) # Train both models
predict_proba(X) # Get [loss_prob, win_prob]
predict_with_confidence(X) # Detailed prediction info
explain_prediction(X) # Feature importance for prediction
save() / load() # Persist to models/game_predictor.joblib
```
**⚠️ NOTE: Model exists but `predict_game()` doesn't use it!**
---
### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training
**AutoTrainer** (Singleton scheduler):
- Runs background loop checking for tasks
- Ingests completed games every hour
- Smart retraining: only after ALL daily games complete
- If new accuracy < old accuracy, reverts model
**ContinuousLearner** (Update workflow):
```
ingest_completed_games() ──► update_features() ──► retrain_model()
```
---
## πŸ—„οΈ Database & Storage
### ChromaDB Cloud
- **Purpose**: Persistent prediction storage
- **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`)
- **Fallback**: `data/processed/predictions_local.json`
### Parquet Files
- `data/api_data/*.parquet` - Cached API responses
- `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games)
### Joblib Files
- `models/game_predictor.joblib` - Trained ML model
- `data/processed/game_dataset.joblib` - Processed training data
---
## 🌐 Frontend Architecture
**React + Vite with custom CSS design system.**
**Pages:**
| Page | File | Purpose |
|------|------|---------|
| Live Games | `LiveGames.jsx` | Today's games, live scores, predictions |
| Predictions | `Predictions.jsx` | Upcoming games with predictions |
| Head to Head | `HeadToHead.jsx` | Compare two teams |
| Accuracy | `Accuracy.jsx` | Model performance stats |
| MVP Race | `MvpRace.jsx` | Current MVP standings |
| Championship | `Championship.jsx` | Championship odds |
**Key Frontend Components:**
- `TeamLogo.jsx` - Official NBA team logos
- `api.js` - API client with base URL handling
- `index.css` - Complete design system (27KB)
---
## πŸ”§ Configuration (`src/config.py`)
**Critical Settings:**
```python
# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}
# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")
# Feature engineering
FEATURE_CONFIG = {
"rolling_windows": [5, 10, 20],
"min_games_for_features": 5
}
# ELO system
ELO_CONFIG = {
"initial_rating": 1500,
"k_factor": 20,
"home_advantage": 100
}
```
---
## ⚠️ Known Issues & Technical Debt
1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor`
2. **Season Hardcoding**: Some places use `2025-26` explicitly
3. **Fallback Data**: Pipeline has hardcoded rosters as backup
4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it
---
## πŸš€ Deployment Notes
**Hugging Face Spaces:**
- Uses persistent `/data` directory for storage
- Dockerfile copies `models/` and `data/api_data/`
- Git LFS for large files (`.joblib`, `.parquet`)
- Port 7860 for HF Spaces
**Environment Variables:**
```
CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths
```
---
## πŸ“‹ Quick Reference: Common Tasks
**Add new API endpoint:**
1. Add route in `server.py` (production) AND `api/api.py` (development)
2. Add frontend call in `web/src/api.js`
3. Create/update page component in `web/src/pages/`
**Modify prediction algorithm:**
1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py`
2. Consider blending with `GamePredictor` model
**Update ML model:**
1. Retrain via `ContinuousLearner.retrain_model()`
2. Or trigger via `POST /api/admin/retrain`
**Add new feature:**
1. Add to `FeatureGenerator` in `feature_engineering.py`
2. Update preprocessing pipeline
3. Retrain model
---
*Last updated: January 2026*