Spaces:

jashdoshi77
/

NBA_PREDICTOR

Running

App Files Files Community

NBA_PREDICTOR / process.md

jashdoshi77

Initial commit: NBA Sage Predictor for Hugging Face Spaces (with LFS for large files)

c095e08 12 days ago

preview code

raw

history blame contribute delete

5.87 kB

	# NBA ML Prediction System - Process Guide

	## Prerequisites

	Before starting, ensure you have:
	- Python 3.10+ installed
	- Virtual environment activated: `.\venv\Scripts\activate`
	- All dependencies installed: `pip install -r requirements.txt`

	---

	## Step 1: Collect Training Data (COMPREHENSIVE)

	Purpose: Fetch 10 seasons of ALL NBA stats from the API including:
	- Games, Team Stats, Player Stats (basic)
	- Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%)
	- Clutch Stats (performance in close games)
	- Hustle Stats (deflections, charges, loose balls)
	- Defense Stats

	File: `src/data_collector.py`

	Command:
	```bash
	python -m src.data_collector
	```

	Duration: ~2-4 hours (has resume capability if interrupted)

	Output Files (in `data/raw/`):
	- `all_games.parquet` - Game results
	- `all_team_stats.parquet` - Basic team stats
	- `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS%
	- `all_team_clutch.parquet` - Close game performance
	- `all_team_hustle.parquet` - Deflections, charges
	- `all_team_defense.parquet` - Defensive metrics
	- `all_player_stats.parquet` - Player averages
	- `all_player_advanced.parquet` - PER, USG%, TS%
	- `all_player_clutch.parquet` - Player clutch stats
	- `all_player_hustle.parquet` - Player hustle metrics

	---

	## Step 2: Generate Features

	Purpose: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue

	File: `src/feature_engineering.py`

	Command:
	```bash
	python -m src.feature_engineering --process
	```

	Duration: ~30-60 minutes

	Output Files:
	- `data/processed/game_features.parquet`

	Features Generated:
	- ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob)
	- Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games)
	- Defensive stats (STL, BLK, DREB rolling)
	- Momentum (wins_last5, hot_streak, cold_streak, plus_minus)
	- Rest/fatigue (days_rest, back_to_back, games_last_week)
	- Season averages (all stats)
	- Team advanced metrics (NET_RTG, PACE, clutch, hustle)
	- Player aggregations (top players avg, star concentration)

	---

	## Step 3: Build Dataset

	Purpose: Split data into train/val/test and prepare for training

	File: `src/preprocessing.py`

	Command:
	```bash
	python -m src.preprocessing --build
	```

	Output Files:
	- `data/processed/game_dataset.joblib`

	What It Does:
	- Automatically detects ALL numeric features
	- Splits by season (no data leakage)
	- Scales and imputes missing values

	---

	## Step 4: Train Model

	Purpose: Train XGBoost + LightGBM ensemble on ALL features

	File: `src/models/game_predictor.py`

	Command:
	```bash
	python -m src.models.game_predictor --train
	```

	Expected Output:
	```
	Loading dataset...
	Training XGBoost model...
	Training LightGBM model...
	Training complete!

	=== Test Metrics ===
	Test Accuracy: 0.67XX
	Test Brier Score: 0.21XX
	✓ Target accuracy (>65%) achieved!

	=== Top Features ===
	feature xgb_importance lgb_importance avg_importance
	0 elo_diff 0.XXX 0.XXX 0.XXX
	1 elo_win_prob 0.XXX 0.XXX 0.XXX
	...

	Saved model to models/game_predictor.joblib
	```

	Output Files:
	- `models/game_predictor.joblib`

	---

	## Step 5: Generate Visualizations

	Purpose: Create analysis charts saved to `graphs/`

	File: `src/visualization.py`

	Command:
	```bash
	python -m src.visualization
	```

	Output Files (in `graphs/`):
	- `mvp_race.png`
	- `mvp_stat_comparison.png`
	- `championship_odds_pie.png`
	- `strength_vs_experience.png`

	---

	## Step 6: Run the Dashboard

	Purpose: Launch Streamlit web interface

	File: `app/app.py`

	Command:
	```bash
	streamlit run app/app.py
	```

	Opens: `http://localhost:8501`

	Pages:
	- 🔴 Live Games - Real-time scores with predictions
	- 🎮 Game Predictions - Predict any matchup
	- 📈 Model Accuracy - Track prediction accuracy
	- 🏆 MVP Race - Top candidates
	- 👑 Championship Odds - Team probabilities
	- 📊 Team Explorer - Stats & injuries

	---

	## Quick Reference

	\| Step \| Command \| Duration \|
	\|------\|---------\|----------\|
	\| 1 \| `python -m src.data_collector` \| 2-4 hours \|
	\| 2 \| `python -m src.feature_engineering --process` \| 30-60 min \|
	\| 3 \| `python -m src.preprocessing --build` \| 1-2 min \|
	\| 4 \| `python -m src.models.game_predictor --train` \| 2-5 min \|
	\| 5 \| `python -m src.visualization` \| 10 sec \|
	\| 6 \| `streamlit run app/app.py` \| Immediate \|

	---

	## Live Data Features (NEW)

	### View Live Scoreboard
	```bash
	python -m src.live_data_collector
	```
	Shows today's NBA games with live scores.

	### Continuous Learning
	```bash
	# Ingest completed games
	python -m src.continuous_learner --ingest

	# Full update cycle (ingest + features + retrain)
	python -m src.continuous_learner --update

	# Update without retraining
	python -m src.continuous_learner --update --no-retrain
	```

	### Check Prediction Accuracy
	```bash
	python -m src.prediction_tracker
	```
	Shows accuracy stats from ChromaDB.

	---

	## Data Flow

	```
	NBA API
	↓
	[Step 1: data_collector.py]
	↓
	data/raw/*.parquet (10+ files)
	↓
	[Step 2: feature_engineering.py]
	↓
	data/processed/game_features.parquet (~50+ features)
	↓
	[Step 3: preprocessing.py]
	↓
	data/processed/game_dataset.joblib (train/val/test splits)
	↓
	[Step 4: game_predictor.py]
	↓
	models/game_predictor.joblib (trained ensemble)
	↓
	[Step 6: app.py] → Web Dashboard
	↓
	ChromaDB (prediction tracking)
	```

	---

	## Troubleshooting

	### ModuleNotFoundError: No module named 'src'
	Ensure you're in the project root directory.

	### API Rate Limit Errors
	The data collector handles this with exponential backoff. Just let it retry.

	### Resume Interrupted Collection
	Just run the command again - it has checkpoint capability and will skip completed data.

	### ChromaDB Connection Issues
	Check your API key in `src/config.py` under `ChromaDBConfig`.