# NBA ML Prediction System - Process Guide

## Prerequisites

Before starting, ensure you have:
- Python 3.10+ installed
- Virtual environment activated: `.\venv\Scripts\activate`
- All dependencies installed: `pip install -r requirements.txt`

---

## Step 1: Collect Training Data (COMPREHENSIVE)

**Purpose**: Fetch 10 seasons of ALL NBA stats from the API including:
- Games, Team Stats, Player Stats (basic)
- Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%)
- Clutch Stats (performance in close games)
- Hustle Stats (deflections, charges, loose balls)
- Defense Stats

**File**: `src/data_collector.py`

**Command**:
```bash
python -m src.data_collector
```

**Duration**: ~2-4 hours (has resume capability if interrupted)

**Output Files** (in `data/raw/`):
- `all_games.parquet` - Game results
- `all_team_stats.parquet` - Basic team stats
- `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS%
- `all_team_clutch.parquet` - Close game performance
- `all_team_hustle.parquet` - Deflections, charges
- `all_team_defense.parquet` - Defensive metrics
- `all_player_stats.parquet` - Player averages
- `all_player_advanced.parquet` - PER, USG%, TS%
- `all_player_clutch.parquet` - Player clutch stats
- `all_player_hustle.parquet` - Player hustle metrics

---

## Step 2: Generate Features

**Purpose**: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue

**File**: `src/feature_engineering.py`

**Command**:
```bash
python -m src.feature_engineering --process
```

**Duration**: ~30-60 minutes

**Output Files**:
- `data/processed/game_features.parquet`

**Features Generated**:
- ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob)
- Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games)
- Defensive stats (STL, BLK, DREB rolling)
- Momentum (wins_last5, hot_streak, cold_streak, plus_minus)
- Rest/fatigue (days_rest, back_to_back, games_last_week)
- Season averages (all stats)
- Team advanced metrics (NET_RTG, PACE, clutch, hustle)
- Player aggregations (top players avg, star concentration)

---

## Step 3: Build Dataset

**Purpose**: Split data into train/val/test and prepare for training

**File**: `src/preprocessing.py`

**Command**:
```bash
python -m src.preprocessing --build
```

**Output Files**:
- `data/processed/game_dataset.joblib`

**What It Does**:
- Automatically detects ALL numeric features
- Splits by season (no data leakage)
- Scales and imputes missing values

---

## Step 4: Train Model

**Purpose**: Train XGBoost + LightGBM ensemble on ALL features

**File**: `src/models/game_predictor.py`

**Command**:
```bash
python -m src.models.game_predictor --train
```

**Expected Output**:
```
Loading dataset...
Training XGBoost model...
Training LightGBM model...
Training complete!

=== Test Metrics ===
Test Accuracy: 0.67XX
Test Brier Score: 0.21XX
✓ Target accuracy (>65%) achieved!

=== Top Features ===
                feature  xgb_importance  lgb_importance  avg_importance
0              elo_diff          0.XXX           0.XXX            0.XXX
1          elo_win_prob          0.XXX           0.XXX            0.XXX
...

Saved model to models/game_predictor.joblib
```

**Output Files**:
- `models/game_predictor.joblib`

---

## Step 5: Generate Visualizations

**Purpose**: Create analysis charts saved to `graphs/`

**File**: `src/visualization.py`

**Command**:
```bash
python -m src.visualization
```

**Output Files** (in `graphs/`):
- `mvp_race.png`
- `mvp_stat_comparison.png`
- `championship_odds_pie.png`
- `strength_vs_experience.png`

---

## Step 6: Run the Dashboard

**Purpose**: Launch Streamlit web interface

**File**: `app/app.py`

**Command**:
```bash
streamlit run app/app.py
```

**Opens**: `http://localhost:8501`

**Pages**:
- 🔴 Live Games - Real-time scores with predictions
- 🎮 Game Predictions - Predict any matchup
- 📈 Model Accuracy - Track prediction accuracy
- 🏆 MVP Race - Top candidates
- 👑 Championship Odds - Team probabilities
- 📊 Team Explorer - Stats & injuries

---

## Quick Reference

| Step | Command | Duration |
|------|---------|----------|
| 1 | `python -m src.data_collector` | 2-4 hours |
| 2 | `python -m src.feature_engineering --process` | 30-60 min |
| 3 | `python -m src.preprocessing --build` | 1-2 min |
| 4 | `python -m src.models.game_predictor --train` | 2-5 min |
| 5 | `python -m src.visualization` | 10 sec |
| 6 | `streamlit run app/app.py` | Immediate |

---

## Live Data Features (NEW)

### View Live Scoreboard
```bash
python -m src.live_data_collector
```
Shows today's NBA games with live scores.

### Continuous Learning
```bash
# Ingest completed games
python -m src.continuous_learner --ingest

# Full update cycle (ingest + features + retrain)
python -m src.continuous_learner --update

# Update without retraining
python -m src.continuous_learner --update --no-retrain
```

### Check Prediction Accuracy
```bash
python -m src.prediction_tracker
```
Shows accuracy stats from ChromaDB.

---

## Data Flow

```
NBA API
   ↓
[Step 1: data_collector.py]
   ↓
data/raw/*.parquet (10+ files)
   ↓
[Step 2: feature_engineering.py]
   ↓
data/processed/game_features.parquet (~50+ features)
   ↓
[Step 3: preprocessing.py]
   ↓
data/processed/game_dataset.joblib (train/val/test splits)
   ↓
[Step 4: game_predictor.py]
   ↓
models/game_predictor.joblib (trained ensemble)
   ↓
[Step 6: app.py] → Web Dashboard
   ↓
ChromaDB (prediction tracking)
```

---

## Troubleshooting

### ModuleNotFoundError: No module named 'src'
Ensure you're in the project root directory.

### API Rate Limit Errors
The data collector handles this with exponential backoff. Just let it retry.

### Resume Interrupted Collection
Just run the command again - it has checkpoint capability and will skip completed data.

### ChromaDB Connection Issues
Check your API key in `src/config.py` under `ChromaDBConfig`.