NBA_PREDICTOR / process.md
jashdoshi77's picture
Initial commit: NBA Sage Predictor for Hugging Face Spaces (with LFS for large files)
c095e08
# NBA ML Prediction System - Process Guide
## Prerequisites
Before starting, ensure you have:
- Python 3.10+ installed
- Virtual environment activated: `.\venv\Scripts\activate`
- All dependencies installed: `pip install -r requirements.txt`
---
## Step 1: Collect Training Data (COMPREHENSIVE)
**Purpose**: Fetch 10 seasons of ALL NBA stats from the API including:
- Games, Team Stats, Player Stats (basic)
- Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%)
- Clutch Stats (performance in close games)
- Hustle Stats (deflections, charges, loose balls)
- Defense Stats
**File**: `src/data_collector.py`
**Command**:
```bash
python -m src.data_collector
```
**Duration**: ~2-4 hours (has resume capability if interrupted)
**Output Files** (in `data/raw/`):
- `all_games.parquet` - Game results
- `all_team_stats.parquet` - Basic team stats
- `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS%
- `all_team_clutch.parquet` - Close game performance
- `all_team_hustle.parquet` - Deflections, charges
- `all_team_defense.parquet` - Defensive metrics
- `all_player_stats.parquet` - Player averages
- `all_player_advanced.parquet` - PER, USG%, TS%
- `all_player_clutch.parquet` - Player clutch stats
- `all_player_hustle.parquet` - Player hustle metrics
---
## Step 2: Generate Features
**Purpose**: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue
**File**: `src/feature_engineering.py`
**Command**:
```bash
python -m src.feature_engineering --process
```
**Duration**: ~30-60 minutes
**Output Files**:
- `data/processed/game_features.parquet`
**Features Generated**:
- ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob)
- Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games)
- Defensive stats (STL, BLK, DREB rolling)
- Momentum (wins_last5, hot_streak, cold_streak, plus_minus)
- Rest/fatigue (days_rest, back_to_back, games_last_week)
- Season averages (all stats)
- Team advanced metrics (NET_RTG, PACE, clutch, hustle)
- Player aggregations (top players avg, star concentration)
---
## Step 3: Build Dataset
**Purpose**: Split data into train/val/test and prepare for training
**File**: `src/preprocessing.py`
**Command**:
```bash
python -m src.preprocessing --build
```
**Output Files**:
- `data/processed/game_dataset.joblib`
**What It Does**:
- Automatically detects ALL numeric features
- Splits by season (no data leakage)
- Scales and imputes missing values
---
## Step 4: Train Model
**Purpose**: Train XGBoost + LightGBM ensemble on ALL features
**File**: `src/models/game_predictor.py`
**Command**:
```bash
python -m src.models.game_predictor --train
```
**Expected Output**:
```
Loading dataset...
Training XGBoost model...
Training LightGBM model...
Training complete!
=== Test Metrics ===
Test Accuracy: 0.67XX
Test Brier Score: 0.21XX
โœ“ Target accuracy (>65%) achieved!
=== Top Features ===
feature xgb_importance lgb_importance avg_importance
0 elo_diff 0.XXX 0.XXX 0.XXX
1 elo_win_prob 0.XXX 0.XXX 0.XXX
...
Saved model to models/game_predictor.joblib
```
**Output Files**:
- `models/game_predictor.joblib`
---
## Step 5: Generate Visualizations
**Purpose**: Create analysis charts saved to `graphs/`
**File**: `src/visualization.py`
**Command**:
```bash
python -m src.visualization
```
**Output Files** (in `graphs/`):
- `mvp_race.png`
- `mvp_stat_comparison.png`
- `championship_odds_pie.png`
- `strength_vs_experience.png`
---
## Step 6: Run the Dashboard
**Purpose**: Launch Streamlit web interface
**File**: `app/app.py`
**Command**:
```bash
streamlit run app/app.py
```
**Opens**: `http://localhost:8501`
**Pages**:
- ๐Ÿ”ด Live Games - Real-time scores with predictions
- ๐ŸŽฎ Game Predictions - Predict any matchup
- ๐Ÿ“ˆ Model Accuracy - Track prediction accuracy
- ๐Ÿ† MVP Race - Top candidates
- ๐Ÿ‘‘ Championship Odds - Team probabilities
- ๐Ÿ“Š Team Explorer - Stats & injuries
---
## Quick Reference
| Step | Command | Duration |
|------|---------|----------|
| 1 | `python -m src.data_collector` | 2-4 hours |
| 2 | `python -m src.feature_engineering --process` | 30-60 min |
| 3 | `python -m src.preprocessing --build` | 1-2 min |
| 4 | `python -m src.models.game_predictor --train` | 2-5 min |
| 5 | `python -m src.visualization` | 10 sec |
| 6 | `streamlit run app/app.py` | Immediate |
---
## Live Data Features (NEW)
### View Live Scoreboard
```bash
python -m src.live_data_collector
```
Shows today's NBA games with live scores.
### Continuous Learning
```bash
# Ingest completed games
python -m src.continuous_learner --ingest
# Full update cycle (ingest + features + retrain)
python -m src.continuous_learner --update
# Update without retraining
python -m src.continuous_learner --update --no-retrain
```
### Check Prediction Accuracy
```bash
python -m src.prediction_tracker
```
Shows accuracy stats from ChromaDB.
---
## Data Flow
```
NBA API
โ†“
[Step 1: data_collector.py]
โ†“
data/raw/*.parquet (10+ files)
โ†“
[Step 2: feature_engineering.py]
โ†“
data/processed/game_features.parquet (~50+ features)
โ†“
[Step 3: preprocessing.py]
โ†“
data/processed/game_dataset.joblib (train/val/test splits)
โ†“
[Step 4: game_predictor.py]
โ†“
models/game_predictor.joblib (trained ensemble)
โ†“
[Step 6: app.py] โ†’ Web Dashboard
โ†“
ChromaDB (prediction tracking)
```
---
## Troubleshooting
### ModuleNotFoundError: No module named 'src'
Ensure you're in the project root directory.
### API Rate Limit Errors
The data collector handles this with exponential backoff. Just let it retry.
### Resume Interrupted Collection
Just run the command again - it has checkpoint capability and will skip completed data.
### ChromaDB Connection Issues
Check your API key in `src/config.py` under `ChromaDBConfig`.