File size: 5,872 Bytes
c095e08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
# NBA ML Prediction System - Process Guide

## Prerequisites

Before starting, ensure you have:
- Python 3.10+ installed
- Virtual environment activated: `.\venv\Scripts\activate`
- All dependencies installed: `pip install -r requirements.txt`

---

## Step 1: Collect Training Data (COMPREHENSIVE)

**Purpose**: Fetch 10 seasons of ALL NBA stats from the API including:
- Games, Team Stats, Player Stats (basic)
- Advanced Metrics (NET_RTG, PACE, PIE, TS%, eFG%)
- Clutch Stats (performance in close games)
- Hustle Stats (deflections, charges, loose balls)
- Defense Stats

**File**: `src/data_collector.py`

**Command**:
```bash
python -m src.data_collector
```

**Duration**: ~2-4 hours (has resume capability if interrupted)

**Output Files** (in `data/raw/`):
- `all_games.parquet` - Game results
- `all_team_stats.parquet` - Basic team stats
- `all_team_advanced.parquet` - NET_RTG, PACE, PIE, TS%
- `all_team_clutch.parquet` - Close game performance
- `all_team_hustle.parquet` - Deflections, charges
- `all_team_defense.parquet` - Defensive metrics
- `all_player_stats.parquet` - Player averages
- `all_player_advanced.parquet` - PER, USG%, TS%
- `all_player_clutch.parquet` - Player clutch stats
- `all_player_hustle.parquet` - Player hustle metrics

---

## Step 2: Generate Features

**Purpose**: Create ~50+ features including ELO, rolling stats, momentum, rest/fatigue

**File**: `src/feature_engineering.py`

**Command**:
```bash
python -m src.feature_engineering --process
```

**Duration**: ~30-60 minutes

**Output Files**:
- `data/processed/game_features.parquet`

**Features Generated**:
- ELO ratings (team_elo, opponent_elo, elo_diff, elo_win_prob)
- Rolling stats (PTS/AST/REB/FG_PCT last 5/10/20 games)
- Defensive stats (STL, BLK, DREB rolling)
- Momentum (wins_last5, hot_streak, cold_streak, plus_minus)
- Rest/fatigue (days_rest, back_to_back, games_last_week)
- Season averages (all stats)
- Team advanced metrics (NET_RTG, PACE, clutch, hustle)
- Player aggregations (top players avg, star concentration)

---

## Step 3: Build Dataset

**Purpose**: Split data into train/val/test and prepare for training

**File**: `src/preprocessing.py`

**Command**:
```bash
python -m src.preprocessing --build
```

**Output Files**:
- `data/processed/game_dataset.joblib`

**What It Does**:
- Automatically detects ALL numeric features
- Splits by season (no data leakage)
- Scales and imputes missing values

---

## Step 4: Train Model

**Purpose**: Train XGBoost + LightGBM ensemble on ALL features

**File**: `src/models/game_predictor.py`

**Command**:
```bash
python -m src.models.game_predictor --train
```

**Expected Output**:
```
Loading dataset...
Training XGBoost model...
Training LightGBM model...
Training complete!

=== Test Metrics ===
Test Accuracy: 0.67XX
Test Brier Score: 0.21XX
โœ“ Target accuracy (>65%) achieved!

=== Top Features ===
                feature  xgb_importance  lgb_importance  avg_importance
0              elo_diff          0.XXX           0.XXX            0.XXX
1          elo_win_prob          0.XXX           0.XXX            0.XXX
...

Saved model to models/game_predictor.joblib
```

**Output Files**:
- `models/game_predictor.joblib`

---

## Step 5: Generate Visualizations

**Purpose**: Create analysis charts saved to `graphs/`

**File**: `src/visualization.py`

**Command**:
```bash
python -m src.visualization
```

**Output Files** (in `graphs/`):
- `mvp_race.png`
- `mvp_stat_comparison.png`
- `championship_odds_pie.png`
- `strength_vs_experience.png`

---

## Step 6: Run the Dashboard

**Purpose**: Launch Streamlit web interface

**File**: `app/app.py`

**Command**:
```bash
streamlit run app/app.py
```

**Opens**: `http://localhost:8501`

**Pages**:
- ๐Ÿ”ด Live Games - Real-time scores with predictions
- ๐ŸŽฎ Game Predictions - Predict any matchup
- ๐Ÿ“ˆ Model Accuracy - Track prediction accuracy
- ๐Ÿ† MVP Race - Top candidates
- ๐Ÿ‘‘ Championship Odds - Team probabilities
- ๐Ÿ“Š Team Explorer - Stats & injuries

---

## Quick Reference

| Step | Command | Duration |
|------|---------|----------|
| 1 | `python -m src.data_collector` | 2-4 hours |
| 2 | `python -m src.feature_engineering --process` | 30-60 min |
| 3 | `python -m src.preprocessing --build` | 1-2 min |
| 4 | `python -m src.models.game_predictor --train` | 2-5 min |
| 5 | `python -m src.visualization` | 10 sec |
| 6 | `streamlit run app/app.py` | Immediate |

---

## Live Data Features (NEW)

### View Live Scoreboard
```bash
python -m src.live_data_collector
```
Shows today's NBA games with live scores.

### Continuous Learning
```bash
# Ingest completed games
python -m src.continuous_learner --ingest

# Full update cycle (ingest + features + retrain)
python -m src.continuous_learner --update

# Update without retraining
python -m src.continuous_learner --update --no-retrain
```

### Check Prediction Accuracy
```bash
python -m src.prediction_tracker
```
Shows accuracy stats from ChromaDB.

---

## Data Flow

```
NBA API
   โ†“
[Step 1: data_collector.py]
   โ†“
data/raw/*.parquet (10+ files)
   โ†“
[Step 2: feature_engineering.py]
   โ†“
data/processed/game_features.parquet (~50+ features)
   โ†“
[Step 3: preprocessing.py]
   โ†“
data/processed/game_dataset.joblib (train/val/test splits)
   โ†“
[Step 4: game_predictor.py]
   โ†“
models/game_predictor.joblib (trained ensemble)
   โ†“
[Step 6: app.py] โ†’ Web Dashboard
   โ†“
ChromaDB (prediction tracking)
```

---

## Troubleshooting

### ModuleNotFoundError: No module named 'src'
Ensure you're in the project root directory.

### API Rate Limit Errors
The data collector handles this with exponential backoff. Just let it retry.

### Resume Interrupted Collection
Just run the command again - it has checkpoint capability and will skip completed data.

### ChromaDB Connection Issues
Check your API key in `src/config.py` under `ChromaDBConfig`.