Spaces:
Running
Running
| # NBA ML Prediction System - Process Guide | |
| ## Prerequisites | |
| Before starting, ensure you have: | |
| - Python 3.10+ installed | |
| - Virtual environment activated: `.\venv\Scripts\activate` | |
| - All dependencies installed: `pip install -r requirements.txt` | |
| --- | |
| ## Step 1: Collect Training Data (COMPREHENSIVE) | |
| **Purpose**: Fetch 10 seasons of ALL NBA stats from the API including: | |
| - Games, Team Stats, Player Stats (basic) | |
| - Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%) | |
| - Clutch Stats (performance in close games) | |
| - Hustle Stats (deflections, charges, loose balls) | |
| - Defense Stats | |
| **File**: `src/data_collector.py` | |
| **Command**: | |
| ```bash | |
| python -m src.data_collector | |
| ``` | |
| **Duration**: ~2-4 hours (has resume capability if interrupted) | |
| **Output Files** (in `data/raw/`): | |
| - `all_games.parquet` - Game results | |
| - `all_team_stats.parquet` - Basic team stats | |
| - `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS% | |
| - `all_team_clutch.parquet` - Close game performance | |
| - `all_team_hustle.parquet` - Deflections, charges | |
| - `all_team_defense.parquet` - Defensive metrics | |
| - `all_player_stats.parquet` - Player averages | |
| - `all_player_advanced.parquet` - PER, USG%, TS% | |
| - `all_player_clutch.parquet` - Player clutch stats | |
| - `all_player_hustle.parquet` - Player hustle metrics | |
| --- | |
| ## Step 2: Generate Features | |
| **Purpose**: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue | |
| **File**: `src/feature_engineering.py` | |
| **Command**: | |
| ```bash | |
| python -m src.feature_engineering --process | |
| ``` | |
| **Duration**: ~30-60 minutes | |
| **Output Files**: | |
| - `data/processed/game_features.parquet` | |
| **Features Generated**: | |
| - ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob) | |
| - Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games) | |
| - Defensive stats (STL, BLK, DREB rolling) | |
| - Momentum (wins_last5, hot_streak, cold_streak, plus_minus) | |
| - Rest/fatigue (days_rest, back_to_back, games_last_week) | |
| - Season averages (all stats) | |
| - Team advanced metrics (NET_RTG, PACE, clutch, hustle) | |
| - Player aggregations (top players avg, star concentration) | |
| --- | |
| ## Step 3: Build Dataset | |
| **Purpose**: Split data into train/val/test and prepare for training | |
| **File**: `src/preprocessing.py` | |
| **Command**: | |
| ```bash | |
| python -m src.preprocessing --build | |
| ``` | |
| **Output Files**: | |
| - `data/processed/game_dataset.joblib` | |
| **What It Does**: | |
| - Automatically detects ALL numeric features | |
| - Splits by season (no data leakage) | |
| - Scales and imputes missing values | |
| --- | |
| ## Step 4: Train Model | |
| **Purpose**: Train XGBoost + LightGBM ensemble on ALL features | |
| **File**: `src/models/game_predictor.py` | |
| **Command**: | |
| ```bash | |
| python -m src.models.game_predictor --train | |
| ``` | |
| **Expected Output**: | |
| ``` | |
| Loading dataset... | |
| Training XGBoost model... | |
| Training LightGBM model... | |
| Training complete! | |
| === Test Metrics === | |
| Test Accuracy: 0.67XX | |
| Test Brier Score: 0.21XX | |
| โ Target accuracy (>65%) achieved! | |
| === Top Features === | |
| feature xgb_importance lgb_importance avg_importance | |
| 0 elo_diff 0.XXX 0.XXX 0.XXX | |
| 1 elo_win_prob 0.XXX 0.XXX 0.XXX | |
| ... | |
| Saved model to models/game_predictor.joblib | |
| ``` | |
| **Output Files**: | |
| - `models/game_predictor.joblib` | |
| --- | |
| ## Step 5: Generate Visualizations | |
| **Purpose**: Create analysis charts saved to `graphs/` | |
| **File**: `src/visualization.py` | |
| **Command**: | |
| ```bash | |
| python -m src.visualization | |
| ``` | |
| **Output Files** (in `graphs/`): | |
| - `mvp_race.png` | |
| - `mvp_stat_comparison.png` | |
| - `championship_odds_pie.png` | |
| - `strength_vs_experience.png` | |
| --- | |
| ## Step 6: Run the Dashboard | |
| **Purpose**: Launch Streamlit web interface | |
| **File**: `app/app.py` | |
| **Command**: | |
| ```bash | |
| streamlit run app/app.py | |
| ``` | |
| **Opens**: `http://localhost:8501` | |
| **Pages**: | |
| - ๐ด Live Games - Real-time scores with predictions | |
| - ๐ฎ Game Predictions - Predict any matchup | |
| - ๐ Model Accuracy - Track prediction accuracy | |
| - ๐ MVP Race - Top candidates | |
| - ๐ Championship Odds - Team probabilities | |
| - ๐ Team Explorer - Stats & injuries | |
| --- | |
| ## Quick Reference | |
| | Step | Command | Duration | | |
| |------|---------|----------| | |
| | 1 | `python -m src.data_collector` | 2-4 hours | | |
| | 2 | `python -m src.feature_engineering --process` | 30-60 min | | |
| | 3 | `python -m src.preprocessing --build` | 1-2 min | | |
| | 4 | `python -m src.models.game_predictor --train` | 2-5 min | | |
| | 5 | `python -m src.visualization` | 10 sec | | |
| | 6 | `streamlit run app/app.py` | Immediate | | |
| --- | |
| ## Live Data Features (NEW) | |
| ### View Live Scoreboard | |
| ```bash | |
| python -m src.live_data_collector | |
| ``` | |
| Shows today's NBA games with live scores. | |
| ### Continuous Learning | |
| ```bash | |
| # Ingest completed games | |
| python -m src.continuous_learner --ingest | |
| # Full update cycle (ingest + features + retrain) | |
| python -m src.continuous_learner --update | |
| # Update without retraining | |
| python -m src.continuous_learner --update --no-retrain | |
| ``` | |
| ### Check Prediction Accuracy | |
| ```bash | |
| python -m src.prediction_tracker | |
| ``` | |
| Shows accuracy stats from ChromaDB. | |
| --- | |
| ## Data Flow | |
| ``` | |
| NBA API | |
| โ | |
| [Step 1: data_collector.py] | |
| โ | |
| data/raw/*.parquet (10+ files) | |
| โ | |
| [Step 2: feature_engineering.py] | |
| โ | |
| data/processed/game_features.parquet (~50+ features) | |
| โ | |
| [Step 3: preprocessing.py] | |
| โ | |
| data/processed/game_dataset.joblib (train/val/test splits) | |
| โ | |
| [Step 4: game_predictor.py] | |
| โ | |
| models/game_predictor.joblib (trained ensemble) | |
| โ | |
| [Step 6: app.py] โ Web Dashboard | |
| โ | |
| ChromaDB (prediction tracking) | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### ModuleNotFoundError: No module named 'src' | |
| Ensure you're in the project root directory. | |
| ### API Rate Limit Errors | |
| The data collector handles this with exponential backoff. Just let it retry. | |
| ### Resume Interrupted Collection | |
| Just run the command again - it has checkpoint capability and will skip completed data. | |
| ### ChromaDB Connection Issues | |
| Check your API key in `src/config.py` under `ChromaDBConfig`. | |