Spaces:

jashdoshi77
/

NBA_PREDICTOR

Running

App Files Files Community

NBA_PREDICTOR / claude.md

jashdoshi77

Add analytics, confidence meter, enhanced H2H, daily MVP refresh

3e6f1d3 about 2 months ago

preview code

raw

history blame contribute delete

12 kB

	# NBA Sage - Complete Codebase Context for AI Assistants

	> Purpose: This document provides comprehensive context for AI assistants to understand the NBA Sage prediction system. Read this entire document before making any code modifications.

	---

	## 🏗️ Project Architecture Overview

	```
	NBA ML/
	├── server.py # Main production server (Flask + React)
	├── api/api.py # Development API server
	│
	├── src/ # Core Python modules
	│ ├── prediction_pipeline.py # Main prediction orchestrator
	│ ├── feature_engineering.py # ELO + feature generation
	│ ├── data_collector.py # Historical NBA API data
	│ ├── live_data_collector.py # Real-time game data
	│ ├── injury_collector.py # Player injury tracking
	│ ├── prediction_tracker.py # ChromaDB prediction storage
	│ ├── auto_trainer.py # Automated training scheduler
	│ ├── continuous_learner.py # Incremental model updates
	│ ├── preprocessing.py # Data preprocessing
	│ ├── config.py # Global configuration
	│ └── models/
	│ ├── game_predictor.py # XGBoost+LightGBM ensemble
	│ ├── mvp_predictor.py # MVP prediction model
	│ └── championship_predictor.py
	│
	├── web/ # React Frontend
	│ └── src/
	│ ├── App.jsx # Main app with sidebar navigation
	│ ├── pages/ # LiveGames, Predictions, MVP, etc.
	│ ├── api.js # API client
	│ └── index.css # Comprehensive CSS design system
	│
	├── data/
	│ ├── api_data/ # Cached NBA API responses (parquet)
	│ ├── processed/ # Processed datasets (joblib)
	│ └── raw/ # Raw game data
	│
	└── models/
	└── game_predictor.joblib # Trained ML model
	```

	---

	## 🔄 Data Flow

	```
	NBA API ──► Data Collectors ──► Feature Engineering ──► ML Training
	│
	▼
	Live API ──► Live Collector ──► Prediction Pipeline ──► Flask API ──► React UI
	│
	▼
	Prediction Tracker (ChromaDB)
	```

	---

	## 📦 Core Components Deep Dive

	### 1. `server.py` - Production Server (39KB, 929 lines)

	Critical for Hugging Face deployment. Combines Flask API + React static serving.

	Key Sections:
	- Cache Configuration (lines 30-40): In-memory caching for rosters, predictions, live games
	- Startup Cache Warming (lines 140-225): `warm_starter_cache()` fetches all 30 team rosters on startup
	- Background Scheduler (lines 340-370): APScheduler jobs for ELO updates, retraining, prediction sync
	- API Endpoints (lines 400-860): All REST endpoints for frontend

	Important Functions:
	```python
	warm_starter_cache() # Fetches real NBA API data for all teams
	startup_cache_warming() # Runs synchronously on server start
	auto_retrain_model() # Smart retraining after all daily games complete
	sync_prediction_results() # Updates prediction correctness from final scores
	update_elo_ratings() # Daily ELO recalculation
	```

	Endpoints:
	- `GET /api/live-games` - Today's games with predictions
	- `GET /api/roster/<team>` - Team's projected starting 5
	- `GET /api/accuracy` - Model accuracy statistics
	- `GET /api/mvp` - MVP race standings
	- `GET /api/championship` - Championship odds

	---

	### 2. `prediction_pipeline.py` - Prediction Orchestrator (41KB, 765 lines)

	The heart of the system. Orchestrates all predictions.

	Key Properties:
	```python
	self.live_collector # LiveDataCollector instance
	self.injury_collector # InjuryCollector instance
	self.feature_gen # FeatureGenerator instance
	self.tracker # PredictionTracker (ChromaDB)
	self._game_model # Lazy-loaded GamePredictor
	```

	Important Methods:

	\| Method \| Purpose \|
	\|--------\|---------\|
	\| `predict_game(home, away)` \| Generate single game prediction \|
	\| `get_upcoming_games(days)` \| Fetch future NBA schedule \|
	\| `get_mvp_race()` \| Calculate MVP standings from live stats \|
	\| `get_championship_odds()` \| Calculate championship probabilities \|
	\| `get_team_roster(team)` \| Fast fallback roster data \|

	⚠️ CRITICAL: Prediction Algorithm (lines 349-504)

	The `predict_game()` method uses a formula-based approach, NOT the trained ML model:

	```python
	# Weights in predict_game():
	home_talent = (
	0.40 * home_win_pct + # Current season record
	0.30 * home_form + # Last 10 games
	0.20 * home_elo_strength + # Historical ELO
	0.10 * 0.5 # Baseline
	)
	# Plus: +3.5% home court, -2% per injury point
	# Uses Log5 formula for head-to-head probability
	```

	The trained `GamePredictor` model exists but is NOT called for live predictions.

	---

	### 3. `feature_engineering.py` - Feature Generation (29KB, 696 lines)

	Contains ELO system and all feature generation logic.

	Classes:

	\| Class \| Purpose \| Key Methods \|
	\|-------\|---------\|-------------\|
	\| `ELOCalculator` \| ELO rating system \| `update_ratings()`, `calculate_game_features()` \|
	\| `EraNormalizer` \| Z-score normalization across seasons \| `fit_season()`, `transform()` \|
	\| `StatLoader` \| Load all stat types \| `get_team_season_stats()`, `get_team_top_players_stats()` \|
	\| `FeatureGenerator` \| Main feature orchestrator \| `generate_game_features()`, `generate_features_for_dataset()` \|

	ELO Configuration:
	```python
	initial_rating = 1500
	k_factor = 20
	home_advantage = 100 # ELO points for home court
	regression_factor = 0.25 # Season regression to mean
	```

	Feature Types Generated:
	- ELO features (team_elo, opponent_elo, elo_diff, elo_expected_win)
	- Rolling averages (5, 10, 20 game windows)
	- Rest days, back-to-back detection
	- Season record features
	- Head-to-head history

	---

	### 4. `data_collector.py` - Historical Data (27KB, 650 lines)

	Collects comprehensive NBA data from official API.

	Classes:
	\| Class \| Data Collected \|
	\|-------\|---------------\|
	\| `GameDataCollector` \| Game results per season \|
	\| `TeamDataCollector` \| Team stats (basic, advanced, clutch, hustle, defense) \|
	\| `PlayerDataCollector` \| Player stats \|
	\| `CacheManager` \| Parquet file caching \|

	Key Features:
	- Exponential backoff retry for rate limiting
	- Per-season parquet caching
	- Checkpoint system for resumable collection

	---

	### 5. `live_data_collector.py` - Real-Time Data (9KB, 236 lines)

	Uses `nba_api.live` endpoints for real-time game data.

	Key Methods:
	```python
	get_live_scoreboard() # Today's games with live scores
	get_game_boxscore(id) # Detailed box score
	get_games_by_status() # Filter: NOT_STARTED, IN_PROGRESS, FINAL
	```

	Data Fields Returned:
	- game_id, game_code
	- home_team, away_team (tricodes)
	- home_score, away_score
	- period, clock
	- status

	---

	### 6. `prediction_tracker.py` - Persistence (20KB, 508 lines)

	Stores predictions and tracks accuracy using ChromaDB Cloud.

	Features:
	- ChromaDB Cloud integration (with local JSON fallback)
	- Prediction storage before games start
	- Result updating after games complete
	- Comprehensive accuracy statistics

	Key Methods:
	```python
	save_prediction(game_id, prediction) # Store pre-game prediction
	update_result(game_id, winner, scores) # Update with final result
	get_accuracy_stats() # Overall, by confidence, by team
	get_pending_predictions() # Awaiting results
	```

	---

	### 7. `models/game_predictor.py` - ML Model (12KB, 332 lines)

	XGBoost + LightGBM ensemble classifier.

	Architecture:
	```
	Input Features ──┬──► XGBoost ──┐
	│ │──► Weighted Average ──► Win Probability
	└──► LightGBM ─┘
	(50/50 weight)
	```

	Key Methods:
	```python
	train(X_train, y_train, X_val, y_val) # Train both models
	predict_proba(X) # Get [loss_prob, win_prob]
	predict_with_confidence(X) # Detailed prediction info
	explain_prediction(X) # Feature importance for prediction
	save() / load() # Persist to models/game_predictor.joblib
	```

	⚠️ NOTE: Model exists but `predict_game()` doesn't use it!

	---

	### 8. `auto_trainer.py` & `continuous_learner.py` - Auto Training

	AutoTrainer (Singleton scheduler):
	- Runs background loop checking for tasks
	- Ingests completed games every hour
	- Smart retraining: only after ALL daily games complete
	- If new accuracy < old accuracy, reverts model

	ContinuousLearner (Update workflow):
	```
	ingest_completed_games() ──► update_features() ──► retrain_model()
	```

	---

	## 🗄️ Database & Storage

	### ChromaDB Cloud
	- Purpose: Persistent prediction storage
	- Credentials: Set via environment variables (`CHROMA_TENANT`, `CHROMA_DATABASE`, `CHROMA_API_KEY`)
	- Fallback: `data/processed/predictions_local.json`

	### Parquet Files
	- `data/api_data/*.parquet` - Cached API responses
	- `data/api_data/all_games_summary.parquet` - Consolidated game history (41K+ games)

	### Joblib Files
	- `models/game_predictor.joblib` - Trained ML model
	- `data/processed/game_dataset.joblib` - Processed training data

	---

	## 🌐 Frontend Architecture

	React + Vite with custom CSS design system.

	Pages:
	\| Page \| File \| Purpose \|
	\|------\|------\|---------\|
	\| Live Games \| `LiveGames.jsx` \| Today's games, live scores, predictions \|
	\| Predictions \| `Predictions.jsx` \| Upcoming games with predictions \|
	\| Head to Head \| `HeadToHead.jsx` \| Compare two teams \|
	\| Accuracy \| `Accuracy.jsx` \| Model performance stats \|
	\| MVP Race \| `MvpRace.jsx` \| Current MVP standings \|
	\| Championship \| `Championship.jsx` \| Championship odds \|

	Key Frontend Components:
	- `TeamLogo.jsx` - Official NBA team logos
	- `api.js` - API client with base URL handling
	- `index.css` - Complete design system (27KB)

	---

	## 🔧 Configuration (`src/config.py`)

	Critical Settings:
	```python
	# NBA Teams mapping (team_id -> tricode)
	NBA_TEAMS = {1610612737: "ATL", 1610612738: "BOS", ...}

	# Data paths
	API_CACHE_DIR = Path("data/api_data")
	PROCESSED_DATA_DIR = Path("data/processed")
	MODELS_DIR = Path("models")

	# Feature engineering
	FEATURE_CONFIG = {
	"rolling_windows": [5, 10, 20],
	"min_games_for_features": 5
	}

	# ELO system
	ELO_CONFIG = {
	"initial_rating": 1500,
	"k_factor": 20,
	"home_advantage": 100
	}
	```

	---

	## ⚠️ Known Issues & Technical Debt

	1. ML Model Not Used: `predict_game()` uses formula, not trained `GamePredictor`
	2. Season Hardcoding: Some places use `2025-26` explicitly
	3. Fallback Data: Pipeline has hardcoded rosters as backup
	4. Function Order: `warm_starter_cache()` must be defined before scheduler calls it

	---

	## 🚀 Deployment Notes

	Hugging Face Spaces:
	- Uses persistent `/data` directory for storage
	- Dockerfile copies `models/` and `data/api_data/`
	- Git LFS for large files (`.joblib`, `.parquet`)
	- Port 7860 for HF Spaces

	Environment Variables:
	```
	CHROMA_TENANT, CHROMA_DATABASE, CHROMA_API_KEY # ChromaDB
	NBA_ML_DATA_DIR, NBA_ML_MODELS_DIR # Override paths
	```

	---

	## 📋 Quick Reference: Common Tasks

	Add new API endpoint:
	1. Add route in `server.py` (production) AND `api/api.py` (development)
	2. Add frontend call in `web/src/api.js`
	3. Create/update page component in `web/src/pages/`

	Modify prediction algorithm:
	1. Edit `PredictionPipeline.predict_game()` in `prediction_pipeline.py`
	2. Consider blending with `GamePredictor` model

	Update ML model:
	1. Retrain via `ContinuousLearner.retrain_model()`
	2. Or trigger via `POST /api/admin/retrain`

	Add new feature:
	1. Add to `FeatureGenerator` in `feature_engineering.py`
	2. Update preprocessing pipeline
	3. Retrain model

	---

	Last updated: January 2026