# NBA ML Prediction System - Process Guide ## Prerequisites Before starting, ensure you have: - Python 3.10+ installed - Virtual environment activated: `.\venv\Scripts\activate` - All dependencies installed: `pip install -r requirements.txt` --- ## Step 1: Collect Training Data (COMPREHENSIVE) **Purpose**: Fetch 10 seasons of ALL NBA stats from the API including: - Games, Team Stats, Player Stats (basic) - Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%) - Clutch Stats (performance in close games) - Hustle Stats (deflections, charges, loose balls) - Defense Stats **File**: `src/data_collector.py` **Command**: ```bash python -m src.data_collector ``` **Duration**: ~2-4 hours (has resume capability if interrupted) **Output Files** (in `data/raw/`): - `all_games.parquet` - Game results - `all_team_stats.parquet` - Basic team stats - `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS% - `all_team_clutch.parquet` - Close game performance - `all_team_hustle.parquet` - Deflections, charges - `all_team_defense.parquet` - Defensive metrics - `all_player_stats.parquet` - Player averages - `all_player_advanced.parquet` - PER, USG%, TS% - `all_player_clutch.parquet` - Player clutch stats - `all_player_hustle.parquet` - Player hustle metrics --- ## Step 2: Generate Features **Purpose**: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue **File**: `src/feature_engineering.py` **Command**: ```bash python -m src.feature_engineering --process ``` **Duration**: ~30-60 minutes **Output Files**: - `data/processed/game_features.parquet` **Features Generated**: - ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob) - Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games) - Defensive stats (STL, BLK, DREB rolling) - Momentum (wins_last5, hot_streak, cold_streak, plus_minus) - Rest/fatigue (days_rest, back_to_back, games_last_week) - Season averages (all stats) - Team advanced metrics (NET_RTG, PACE, clutch, hustle) - Player aggregations (top players avg, star concentration) --- ## Step 3: Build Dataset **Purpose**: Split data into train/val/test and prepare for training **File**: `src/preprocessing.py` **Command**: ```bash python -m src.preprocessing --build ``` **Output Files**: - `data/processed/game_dataset.joblib` **What It Does**: - Automatically detects ALL numeric features - Splits by season (no data leakage) - Scales and imputes missing values --- ## Step 4: Train Model **Purpose**: Train XGBoost + LightGBM ensemble on ALL features **File**: `src/models/game_predictor.py` **Command**: ```bash python -m src.models.game_predictor --train ``` **Expected Output**: ``` Loading dataset... Training XGBoost model... Training LightGBM model... Training complete! === Test Metrics === Test Accuracy: 0.67XX Test Brier Score: 0.21XX ✓ Target accuracy (>65%) achieved! === Top Features === feature xgb_importance lgb_importance avg_importance 0 elo_diff 0.XXX 0.XXX 0.XXX 1 elo_win_prob 0.XXX 0.XXX 0.XXX ... Saved model to models/game_predictor.joblib ``` **Output Files**: - `models/game_predictor.joblib` --- ## Step 5: Generate Visualizations **Purpose**: Create analysis charts saved to `graphs/` **File**: `src/visualization.py` **Command**: ```bash python -m src.visualization ``` **Output Files** (in `graphs/`): - `mvp_race.png` - `mvp_stat_comparison.png` - `championship_odds_pie.png` - `strength_vs_experience.png` --- ## Step 6: Run the Dashboard **Purpose**: Launch Streamlit web interface **File**: `app/app.py` **Command**: ```bash streamlit run app/app.py ``` **Opens**: `http://localhost:8501` **Pages**: - 🔴 Live Games - Real-time scores with predictions - 🎮 Game Predictions - Predict any matchup - 📈 Model Accuracy - Track prediction accuracy - 🏆 MVP Race - Top candidates - 👑 Championship Odds - Team probabilities - 📊 Team Explorer - Stats & injuries --- ## Quick Reference | Step | Command | Duration | |------|---------|----------| | 1 | `python -m src.data_collector` | 2-4 hours | | 2 | `python -m src.feature_engineering --process` | 30-60 min | | 3 | `python -m src.preprocessing --build` | 1-2 min | | 4 | `python -m src.models.game_predictor --train` | 2-5 min | | 5 | `python -m src.visualization` | 10 sec | | 6 | `streamlit run app/app.py` | Immediate | --- ## Live Data Features (NEW) ### View Live Scoreboard ```bash python -m src.live_data_collector ``` Shows today's NBA games with live scores. ### Continuous Learning ```bash # Ingest completed games python -m src.continuous_learner --ingest # Full update cycle (ingest + features + retrain) python -m src.continuous_learner --update # Update without retraining python -m src.continuous_learner --update --no-retrain ``` ### Check Prediction Accuracy ```bash python -m src.prediction_tracker ``` Shows accuracy stats from ChromaDB. --- ## Data Flow ``` NBA API ↓ [Step 1: data_collector.py] ↓ data/raw/*.parquet (10+ files) ↓ [Step 2: feature_engineering.py] ↓ data/processed/game_features.parquet (~50+ features) ↓ [Step 3: preprocessing.py] ↓ data/processed/game_dataset.joblib (train/val/test splits) ↓ [Step 4: game_predictor.py] ↓ models/game_predictor.joblib (trained ensemble) ↓ [Step 6: app.py] → Web Dashboard ↓ ChromaDB (prediction tracking) ``` --- ## Troubleshooting ### ModuleNotFoundError: No module named 'src' Ensure you're in the project root directory. ### API Rate Limit Errors The data collector handles this with exponential backoff. Just let it retry. ### Resume Interrupted Collection Just run the command again - it has checkpoint capability and will skip completed data. ### ChromaDB Connection Issues Check your API key in `src/config.py` under `ChromaDBConfig`.