Spaces:

jashdoshi77
/

NBA_PREDICTOR

Running

File size: 12,013 Bytes

3e6f1d3

# NBA Sage - Complete Codebase Context for AI Assistants

> **Purpose**: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.

---

## 🏗️ Project Architecture Overview

```
NBA ML/
├── server.py              # Main production server (Flask + React)
├── api/api.py             # Development API server
│
├── src/                   # Core Python modules
│   ├── prediction_pipeline.py   # Main prediction orchestrator
│   ├── feature_engineering.py   # ELO + feature generation
│   ├── data_collector.py        # Historical NBA API data
│   ├── live_data_collector.py   # Real-time game data
│   ├── injury_collector.py      # Player injury tracking
│   ├── prediction_tracker.py    # ChromaDB prediction storage
│   ├── auto_trainer.py          # Automated training scheduler
│   ├── continuous_learner.py    # Incremental model updates
│   ├── preprocessing.py         # Data preprocessing
│   ├── config.py                # Global configuration
│   └── models/
│       ├── game_predictor.py    # XGBoost+LightGBM ensemble
│       ├── mvp_predictor.py     # MVP prediction model
│       └── championship_predictor.py
│
├── web/                   # React Frontend
│   └── src/
│       ├── App.jsx        # Main app with sidebar navigation
│       ├── pages/         # LiveGames, Predictions, MVP, etc.
│       ├── api.js         # API client
│       └── index.css      # Comprehensive CSS design system
│
├── data/
│   ├── api_data/          # Cached NBA API responses (parquet)
│   ├── processed/         # Processed datasets (joblib)
│   └── raw/               # Raw game data
│
└── models/
    └── game_predictor.joblib  # Trained ML model
```

---

## 🔄 Data Flow

```
NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
                                       │
                                       ▼
Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
                                       │
                                       ▼
                              Prediction Tracker (ChromaDB)
```

---

## 📦 Core Components Deep Dive

### 1. `server.py` - Production Server (39KB, 929 lines)

**Critical for Hugging Face deployment. Combines Flask API + React static serving.**

**Key Sections:**
- **Cache Configuration (lines 30-40)**: In-memory caching for rosters, predictions, live games
- **Startup Cache Warming (lines 140-225)**: `warm_starter_cache()` fetches all 30 team rosters on startup
- **Background Scheduler (lines 340-370)**: APScheduler jobs for ELO updates, retraining, prediction sync
- **API Endpoints (lines 400-860)**: All REST endpoints for frontend

**Important Functions:**
```python
warm_starter_cache()      # Fetches real NBA API data for all teams
startup_cache_warming()   # Runs synchronously on server start
auto_retrain_model()      # Smart retraining after all daily games complete
sync_prediction_results() # Updates prediction correctness from final scores
update_elo_ratings()      # Daily ELO recalculation
```

**Endpoints:**
- `GET /api/live-games` - Today's games with predictions
- `GET /api/roster/<team>` - Team's projected starting 5
- `GET /api/accuracy` - Model accuracy statistics
- `GET /api/mvp` - MVP race standings
- `GET /api/championship` - Championship odds

---

### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)

**The heart of the system. Orchestrates all predictions.**

**Key Properties:**
```python
self.live_collector      # LiveDataCollector instance
self.injury_collector    # InjuryCollector instance
self.feature_gen         # FeatureGenerator instance
self.tracker             # PredictionTracker (ChromaDB)
self._game_model         # Lazy-loaded GamePredictor
```

**Important Methods:**

| Method | Purpose |
|--------|---------|
| `predict_game(home, away)` | Generate single game prediction |
| `get_upcoming_games(days)` | Fetch future NBA schedule |
| `get_mvp_race()` | Calculate MVP standings from live stats |
| `get_championship_odds()` | Calculate championship probabilities |
| `get_team_roster(team)` | Fast fallback roster data |

**⚠️ CRITICAL: Prediction Algorithm (lines 349-504)**

The `predict_game()` method uses a **formula-based approach**, NOT the trained ML model:

```python
# Weights in predict_game():
home_talent = (
    0.40 * home_win_pct +      # Current season record
    0.30 * home_form +          # Last 10 games
    0.20 * home_elo_strength +  # Historical ELO
    0.10 * 0.5                  # Baseline
)
# Plus: +3.5% home court, -2% per injury point
# Uses Log5 formula for head-to-head probability
```

The trained `GamePredictor` model exists but is NOT called for live predictions.

---

### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)

**Contains ELO system and all feature generation logic.**

**Classes:**

| Class | Purpose | Key Methods |
|-------|---------|-------------|
| `ELOCalculator` | ELO rating system | `update_ratings()`, `calculate_game_features()` |
| `EraNormalizer` | Z-score normalization across seasons | `fit_season()`, `transform()` |
| `StatLoader` | Load all stat types | `get_team_season_stats()`, `get_team_top_players_stats()` |
| `FeatureGenerator` | Main feature orchestrator | `generate_game_features()`, `generate_features_for_dataset()` |

**ELO Configuration:**
```python
initial_rating = 1500
k_factor = 20
home_advantage = 100  # ELO points for home court
regression_factor = 0.25  # Season regression to mean
```

**Feature Types Generated:**
- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
- Rolling averages (5, 10, 20 game windows)
- Rest days, back-to-back detection
- Season record features
- Head-to-head history

---

### 4. `data_collector.py` - Historical Data (27KB, 650 lines)

**Collects comprehensive NBA data from official API.**

**Classes:**
| Class | Data Collected |
|-------|---------------|
| `GameDataCollector` | Game results per season |
| `TeamDataCollector` | Team stats (basic, advanced, clutch, hustle, defense) |
| `PlayerDataCollector` | Player stats |
| `CacheManager` | Parquet file caching |

**Key Features:**
- Exponential backoff retry for rate limiting
- Per-season parquet caching
- Checkpoint system for resumable collection

---

### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)

**Uses `nba_api.live` endpoints for real-time game data.**

**Key Methods:**
```python
get_live_scoreboard()  # Today's games with live scores
get_game_boxscore(id)  # Detailed box score
get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
```

**Data Fields Returned:**
- game_id, game_code
- home_team, away_team (tricodes)
- home_score, away_score
- period, clock
- status

---

### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines)

**Stores predictions and tracks accuracy using ChromaDB Cloud.**

**Features:**
- ChromaDB Cloud integration (with local JSON fallback)
- Prediction storage before games start
- Result updating after games complete
- Comprehensive accuracy statistics

**Key Methods:**
```python
save_prediction(game_id, prediction)  # Store pre-game prediction
update_result(game_id, winner, scores)  # Update with final result
get_accuracy_stats()  # Overall, by confidence, by team
get_pending_predictions()  # Awaiting results
```

---

### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines)

**XGBoost + LightGBM ensemble classifier.**

**Architecture:**
```
Input Features ──┬──► XGBoost ──┐
                 │              │──► Weighted Average ──► Win Probability
                 └──► LightGBM ─┘
                      (50/50 weight)
```

**Key Methods:**
```python
train(X_train, y_train, X_val, y_val)  # Train both models
predict_proba(X)  # Get [loss_prob, win_prob]
predict_with_confidence(X)  # Detailed prediction info
explain_prediction(X)  # Feature importance for prediction
save() / load()  # Persist to models/game_predictor.joblib
```

**⚠️ NOTE: Model exists but `predict_game()` doesn't use it!**

---

### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training

**AutoTrainer** (Singleton scheduler):
- Runs background loop checking for tasks
- Ingests completed games every hour
- Smart retraining: only after ALL daily games complete
- If new accuracy < old accuracy, reverts model

**ContinuousLearner** (Update workflow):
```
ingest_completed_games() ──► update_features() ──► retrain_model()
```

---

## 🗄️ Database & Storage

### ChromaDB Cloud
- **Purpose**: Persistent prediction storage
- **Credentials**: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`)
- **Fallback**: `data/processed/predictions_local.json`

### Parquet Files
- `data/api_data/*.parquet` - Cached API responses
- `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games)

### Joblib Files
- `models/game_predictor.joblib` - Trained ML model
- `data/processed/game_dataset.joblib` - Processed training data

---

## 🌐 Frontend Architecture

**React + Vite with custom CSS design system.**

**Pages:**
| Page | File | Purpose |
|------|------|---------|
| Live Games | `LiveGames.jsx` | Today's games, live scores, predictions |
| Predictions | `Predictions.jsx` | Upcoming games with predictions |
| Head to Head | `HeadToHead.jsx` | Compare two teams |
| Accuracy | `Accuracy.jsx` | Model performance stats |
| MVP Race | `MvpRace.jsx` | Current MVP standings |
| Championship | `Championship.jsx` | Championship odds |

**Key Frontend Components:**
- `TeamLogo.jsx` - Official NBA team logos
- `api.js` - API client with base URL handling
- `index.css` - Complete design system (27KB)

---

## 🔧 Configuration (`src/config.py`)

**Critical Settings:**
```python
# NBA Teams mapping (team_id -> tricode)
NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}

# Data paths
API_CACHE_DIR = Path("data/api_data")
PROCESSED_DATA_DIR = Path("data/processed")
MODELS_DIR = Path("models")

# Feature engineering
FEATURE_CONFIG = {
    "rolling_windows": [5, 10, 20],
    "min_games_for_features": 5
}

# ELO system
ELO_CONFIG = {
    "initial_rating": 1500,
    "k_factor": 20,
    "home_advantage": 100
}
```

---

## ⚠️ Known Issues & Technical Debt

1. **ML Model Not Used**: `predict_game()` uses formula, not trained `GamePredictor`
2. **Season Hardcoding**: Some places use `2025-26` explicitly
3. **Fallback Data**: Pipeline has hardcoded rosters as backup
4. **Function Order**: `warm_starter_cache()` must be defined before scheduler calls it

---

## 🚀 Deployment Notes

**Hugging Face Spaces:**
- Uses persistent `/data` directory for storage
- Dockerfile copies `models/` and `data/api_data/` 
- Git LFS for large files (`.joblib`, `.parquet`)
- Port 7860 for HF Spaces

**Environment Variables:**
```
CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY  # ChromaDB
NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR  # Override paths
```

---

## 📋 Quick Reference: Common Tasks

**Add new API endpoint:**
1. Add route in `server.py` (production) AND `api/api.py` (development)
2. Add frontend call in `web/src/api.js`
3. Create/update page component in `web/src/pages/`

**Modify prediction algorithm:**
1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py`
2. Consider blending with `GamePredictor` model

**Update ML model:**
1. Retrain via `ContinuousLearner.retrain_model()`
2. Or trigger via `POST /api/admin/retrain`

**Add new feature:**
1. Add to `FeatureGenerator` in `feature_engineering.py`
2. Update preprocessing pipeline
3. Retrain model

---

*Last updated: January 2026*