shekkari21 commited on
Commit
b34fde9
·
1 Parent(s): ac32153

added readme

Browse files
Files changed (2) hide show
  1. EXECUTION_FLOW.md +527 -0
  2. README.md +307 -21
EXECUTION_FLOW.md ADDED
@@ -0,0 +1,527 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Detailed Execution Flow - NBA Analysis Application
2
+
3
+ This document explains step-by-step how user input flows through the application and gets executed.
4
+
5
+ ---
6
+
7
+ ## 🎯 High-Level Flow Overview
8
+
9
+ ```
10
+ User Input (CSV + Query)
11
+
12
+ app.py (Gradio Interface)
13
+
14
+ crew.py (CrewAI Orchestration)
15
+
16
+ agents.py (AI Agents)
17
+
18
+ tasks.py (Task Definitions)
19
+
20
+ tools.py (Data Access Tools)
21
+
22
+ vector_db.py / pandas (Data Processing)
23
+
24
+ config.py (LLM Configuration)
25
+
26
+ LLM API (Hugging Face / Ollama / etc.)
27
+
28
+ Results → User
29
+ ```
30
+
31
+ ---
32
+
33
+ ## 📋 Detailed Step-by-Step Execution
34
+
35
+ ### **Phase 1: User Input & Initialization**
36
+
37
+ #### Step 1.1: User Interaction (`app.py`)
38
+ - **File**: `app.py`
39
+ - **Function**: `process_file_and_analyze()` or `process_question_only()`
40
+ - **Input**:
41
+ - CSV file (uploaded via Gradio)
42
+ - User query (optional text)
43
+ - **What happens**:
44
+ ```python
45
+ # Line 23-24: Validate file exists
46
+ if file is None:
47
+ return "Please upload a CSV file."
48
+
49
+ # Line 27-28: Set default query if empty
50
+ if not user_query:
51
+ user_query = "Provide comprehensive analysis..."
52
+
53
+ # Line 32-33: Extract file path
54
+ file_path = file.name
55
+ csv_path = file_path
56
+ ```
57
+
58
+ #### Step 1.2: Crew Creation (`crew.py`)
59
+ - **File**: `crew.py`
60
+ - **Function**: `create_flow_crew(user_query, csv_path)`
61
+ - **What happens**:
62
+ ```python
63
+ # Line 82-84: Create all agents
64
+ engineer_agent = create_engineer_agent(csv_path)
65
+ analyst_agent = create_analyst_agent(csv_path)
66
+ storyteller_agent = create_storyteller_agent()
67
+
68
+ # Line 88-94: Create tasks
69
+ data_engineering_task = create_data_engineering_task(...)
70
+ custom_analysis_task = create_custom_analysis_task(...)
71
+ storyteller_task = create_storyteller_task(...)
72
+
73
+ # Line 99-104: Create Crew with agents and tasks
74
+ return Crew(agents=[...], tasks=[...], process=Process.sequential)
75
+ ```
76
+
77
+ ---
78
+
79
+ ### **Phase 2: Agent Initialization**
80
+
81
+ #### Step 2.1: LLM Configuration (`config.py`)
82
+ - **File**: `config.py`
83
+ - **Function**: `get_llm()`
84
+ - **What happens**:
85
+ ```python
86
+ # Line 13: Check provider (default: "huggingface")
87
+ LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")
88
+
89
+ # Line 54-64: Create LLM instance based on provider
90
+ if LLM_PROVIDER == "huggingface":
91
+ return LLM(
92
+ model=f"huggingface/{HF_MODEL}",
93
+ api_key=HF_API_KEY
94
+ )
95
+ # Similar for ollama, openrouter, etc.
96
+ ```
97
+ - **Output**: Configured LLM instance (used by all agents)
98
+
99
+ #### Step 2.2: Agent Creation (`agents.py`)
100
+ - **File**: `agents.py`
101
+ - **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()`
102
+ - **What happens**:
103
+
104
+ **Engineer Agent** (Lines 12-36):
105
+ ```python
106
+ # Line 22-23: Get data path and tools
107
+ data_path = csv_path or NBA_DATA_PATH
108
+ agent_tools = get_agent_tools(data_path)
109
+
110
+ # Line 25-36: Create agent with:
111
+ - role: "Data Engineer"
112
+ - goal: Process and clean data
113
+ - backstory: Expert data engineer description
114
+ - llm: Shared LLM instance
115
+ - tools: Data access tools (read, search, analyze)
116
+ ```
117
+
118
+ **Analyst Agent** (Lines 39-69):
119
+ ```python
120
+ # Similar structure but with:
121
+ - role: "Data Analyst"
122
+ - goal: Extract insights and patterns
123
+ - backstory: Includes instructions to use analyze_nba_data for aggregations
124
+ - tools: Same data tools
125
+ ```
126
+
127
+ **Storyteller Agent** (Lines 72-93):
128
+ ```python
129
+ - role: "Sports Storyteller"
130
+ - goal: Create engaging headlines from analysis
131
+ - tools: [] (no data tools, only uses LLM)
132
+ ```
133
+
134
+ #### Step 2.3: Tools Initialization (`tools.py`)
135
+ - **File**: `tools.py`
136
+ - **Function**: `get_agent_tools(data_path)`
137
+ - **What happens**:
138
+ ```python
139
+ # Returns list of 5 tools:
140
+ 1. read_nba_data(limit) - Read sample rows
141
+ 2. search_nba_data(query, column, value) - Filter/search CSV
142
+ 3. get_nba_data_summary() - Get dataset overview
143
+ 4. semantic_search_nba_data(query) - Vector search
144
+ 5. analyze_nba_data(pandas_code) - Execute pandas operations
145
+ ```
146
+ - **Note**: Each tool is wrapped with `@tool` decorator for CrewAI
147
+
148
+ ---
149
+
150
+ ### **Phase 3: Task Execution**
151
+
152
+ #### Step 3.1: Crew Kickoff (`app.py` → `crew.py`)
153
+ - **File**: `app.py` Line 36-37
154
+ - **What happens**:
155
+ ```python
156
+ crew = create_flow_crew(user_query.strip(), csv_path)
157
+ result = crew.kickoff() # This triggers execution
158
+ ```
159
+
160
+ #### Step 3.2: Task 1 - Data Engineering (`tasks.py`)
161
+ - **File**: `tasks.py` Lines 8-40
162
+ - **Task**: `create_data_engineering_task()`
163
+ - **Agent**: Engineer Agent
164
+ - **Execution Flow**:
165
+ ```
166
+ 1. Engineer Agent receives task description
167
+ 2. LLM processes task: "Examine dataset, get summary..."
168
+ 3. Agent decides to use: get_nba_data_summary()
169
+ 4. Tool execution (tools.py):
170
+ - Reads CSV with pandas
171
+ - Calculates stats (rows, columns, unique values)
172
+ - Returns formatted summary
173
+ 5. LLM receives tool output
174
+ 6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
175
+ 7. Task complete → Output stored
176
+ ```
177
+
178
+ #### Step 3.3: Task 2 - Data Analysis (`tasks.py`)
179
+ - **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task)
180
+ - **Agent**: Analyst Agent
181
+ - **Execution Flow**:
182
+ ```
183
+ 1. Analyst Agent receives user query + task description
184
+ 2. LLM analyzes query: "What does user want?"
185
+ 3. Agent decides which tools to use:
186
+ - For aggregations → analyze_nba_data()
187
+ - For searches → search_nba_data() or semantic_search_nba_data()
188
+ - For overview → get_nba_data_summary()
189
+
190
+ 4. Tool Execution Examples:
191
+
192
+ Example A: "Top 5 three-point shooters"
193
+ - Agent generates pandas code:
194
+ df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
195
+ - analyze_nba_data() executes code
196
+ - Returns DataFrame with results
197
+ - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."
198
+
199
+ Example B: "Find LeBron James games"
200
+ - Agent uses search_nba_data(query="LeBron James")
201
+ - Tool filters CSV, returns matching rows
202
+ - LLM analyzes results, provides insights
203
+
204
+ Example C: "High scoring games"
205
+ - Agent uses semantic_search_nba_data("high scoring games")
206
+ - Vector DB finds semantically similar records
207
+ - Returns top matches with similarity scores
208
+ - LLM provides analysis
209
+
210
+ 5. LLM generates final analysis report
211
+ 6. Task complete → Output stored
212
+ ```
213
+
214
+ #### Step 3.4: Task 3 - Storytelling (`tasks.py`)
215
+ - **File**: `tasks.py` Lines 98-130 (create_storyteller_task)
216
+ - **Agent**: Storyteller Agent
217
+ - **Dependency**: Waits for Analyst task to complete
218
+ - **Execution Flow**:
219
+ ```
220
+ 1. Storyteller Agent receives Analyst's output as context
221
+ 2. LLM processes: "Create engaging headline and story"
222
+ 3. No tools used (only LLM)
223
+ 4. LLM generates:
224
+ - Catchy headline
225
+ - Engaging narrative
226
+ - Context and insights
227
+ 5. Task complete → Output stored
228
+ ```
229
+
230
+ ---
231
+
232
+ ### **Phase 4: Tool Execution Details**
233
+
234
+ #### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)
235
+ ```
236
+ Input: limit (number of rows)
237
+ Execution:
238
+ 1. pd.read_csv(data_path)
239
+ 2. df.head(limit)
240
+ 3. Format as string
241
+ Output: Sample rows with column names
242
+ ```
243
+
244
+ #### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)
245
+ ```
246
+ Input: query (text), column (name), value (filter)
247
+ Execution:
248
+ 1. pd.read_csv(data_path)
249
+ 2. Apply filters if provided
250
+ 3. Text search across columns
251
+ 4. Limit to 50 rows max
252
+ Output: Filtered DataFrame as string
253
+ ```
254
+
255
+ #### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)
256
+ ```
257
+ Input: None
258
+ Execution:
259
+ 1. pd.read_csv(data_path)
260
+ 2. Calculate: total rows, columns, unique players/teams
261
+ 3. Get date range
262
+ 4. Identify numeric columns
263
+ 5. Show sample rows
264
+ Output: Comprehensive dataset summary
265
+ ```
266
+
267
+ #### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)
268
+ ```
269
+ Input: query (natural language)
270
+ Execution:
271
+ 1. Get vector_db instance (vector_db.py)
272
+ 2. Check if indexed (if not, index CSV)
273
+ 3. Generate embedding for query
274
+ 4. Search in ChromaDB
275
+ 5. Return top N similar records
276
+ 6. Load original CSV rows
277
+ Output: Similar records with metadata
278
+ ```
279
+
280
+ **Vector DB Indexing** (`vector_db.py` Lines 94-156):
281
+ ```
282
+ First time only:
283
+ 1. Load SentenceTransformer model
284
+ 2. Read CSV
285
+ 3. For each row:
286
+ - Convert to text: "Player: X, Team: Y, Points: Z..."
287
+ - Generate embedding
288
+ - Store in ChromaDB with metadata
289
+ 4. Persist to disk (chroma_db/)
290
+ ```
291
+
292
+ #### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)
293
+ ```
294
+ Input: pandas_code (string of pandas operations)
295
+ Execution:
296
+ 1. Load CSV into DataFrame 'df'
297
+ 2. Create safe namespace: {'pd': pandas, 'df': df}
298
+ 3. Execute: exec(f"result = {pandas_code}", namespace)
299
+ 4. Get result from namespace
300
+ 5. Format output:
301
+ - DataFrame → to_string()
302
+ - Series → to_string()
303
+ - Limit to 50 rows if large
304
+ Output: Analysis results as string
305
+ ```
306
+
307
+ ---
308
+
309
+ ### **Phase 5: LLM Interaction**
310
+
311
+ #### LLM Call Flow (`config.py` → LLM API)
312
+ ```
313
+ 1. Agent needs to process task
314
+ 2. Calls llm.call(prompt, ...)
315
+ 3. config.py routes to provider:
316
+
317
+ Hugging Face:
318
+ - Format: huggingface/{model_name}
319
+ - API: https://api-inference.huggingface.co
320
+ - Request: POST with prompt
321
+ - Response: Generated text
322
+
323
+ Ollama:
324
+ - Base URL: http://localhost:11434/v1
325
+ - OpenAI-compatible API
326
+ - Request: POST /chat/completions
327
+ - Response: Generated text
328
+
329
+ OpenRouter:
330
+ - Base URL: https://openrouter.ai/api/v1
331
+ - Request: POST with model name
332
+ - Response: Generated text
333
+
334
+ 4. LLM generates response
335
+ 5. Response returned to agent
336
+ 6. Agent processes response
337
+ 7. Agent decides next action (use tool? finish? ask for clarification?)
338
+ ```
339
+
340
+ ---
341
+
342
+ ### **Phase 6: Result Aggregation**
343
+
344
+ #### Result Collection (`app.py` Lines 39-80)
345
+ ```
346
+ After crew.kickoff() completes:
347
+
348
+ 1. Extract task outputs:
349
+ - result.tasks_output[0] → Engineer result
350
+ - result.tasks_output[1] → Analyst result
351
+ - result.tasks_output[2] → Storyteller result
352
+
353
+ 2. Format output:
354
+ - Add headers: "## Engineer Agent Results"
355
+ - Add separators: "---"
356
+ - Combine all outputs
357
+
358
+ 3. Store engineer result for reuse
359
+
360
+ 4. Return formatted string to Gradio
361
+ ```
362
+
363
+ #### Gradio Display (`app.py` Lines 200-340)
364
+ ```
365
+ 1. User sees results in output textbox
366
+ 2. Engineer result stored in hidden state
367
+ 3. Can be reused for follow-up questions
368
+ ```
369
+
370
+ ---
371
+
372
+ ## 🔄 Parallel Execution Flow
373
+
374
+ ### How Tasks Run in Parallel (`crew.py` Lines 69-104)
375
+
376
+ ```
377
+ Time →
378
+
379
+ ├─ Task 1: Engineer (independent)
380
+ │ └─ Uses: get_nba_data_summary()
381
+
382
+ ├─ Task 2: Analyst (independent, runs in parallel)
383
+ │ └─ Uses: analyze_nba_data() or search_nba_data()
384
+
385
+ └─ Task 3: Storyteller (waits for Analyst)
386
+ └─ Uses: LLM only (no tools)
387
+ ```
388
+
389
+ **Key Points**:
390
+ - Engineer and Analyst run **simultaneously** (no dependencies)
391
+ - Storyteller runs **after** Analyst completes (has dependency)
392
+ - CrewAI handles parallelization automatically
393
+
394
+ ---
395
+
396
+ ## 📊 Data Flow Diagram
397
+
398
+ ```
399
+ CSV File
400
+
401
+ [pandas.read_csv()]
402
+
403
+ DataFrame
404
+
405
+ ├─→ Tools (read, search, analyze)
406
+ │ ↓
407
+ │ Results → Agent → LLM → Response
408
+
409
+ └─→ Vector DB (semantic search)
410
+
411
+ [SentenceTransformer]
412
+
413
+ Embeddings
414
+
415
+ [ChromaDB]
416
+
417
+ Similar Records → Agent → LLM → Response
418
+ ```
419
+
420
+ ---
421
+
422
+ ## 🎯 Example: Complete Execution Trace
423
+
424
+ ### Input:
425
+ - CSV: `nba24-25.csv`
426
+ - Query: "Who are the top 5 three-point shooters?"
427
+
428
+ ### Execution:
429
+
430
+ 1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")`
431
+ 2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")`
432
+ 3. **agents.py**: Create Engineer, Analyst, Storyteller agents
433
+ 4. **config.py**: `get_llm()` → Returns Hugging Face LLM
434
+ 5. **crew.kickoff()** starts
435
+
436
+ 6. **Task 1 (Engineer)**:
437
+ - Agent: "I need to check the dataset"
438
+ - Tool: `get_nba_data_summary()`
439
+ - Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
440
+ - LLM: "Dataset loaded. 5000 rows, ready for analysis."
441
+
442
+ 7. **Task 2 (Analyst)** - Runs in parallel:
443
+ - Agent: "User wants top 5 three-point shooters"
444
+ - Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")`
445
+ - Execution:
446
+ ```python
447
+ df = pd.read_csv("nba24-25.csv")
448
+ result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
449
+ # Returns: Player1: 250, Player2: 245, ...
450
+ ```
451
+ - LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."
452
+
453
+ 8. **Task 3 (Storyteller)** - After Analyst:
454
+ - Agent receives Analyst output
455
+ - LLM: "🏀 **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..."
456
+
457
+ 9. **app.py**: Combine all outputs
458
+ 10. **Gradio**: Display to user
459
+
460
+ ---
461
+
462
+ ## 🔧 Key Configuration Points
463
+
464
+ ### LLM Provider Selection (`config.py`)
465
+ - Environment variable: `LLM_PROVIDER`
466
+ - Options: `huggingface`, `ollama`, `openrouter`, `openai`
467
+ - Default: `huggingface`
468
+
469
+ ### Model Selection
470
+ - Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`)
471
+ - Ollama: `OLLAMA_MODEL` (default: `mistral`)
472
+ - OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`)
473
+
474
+ ### Data Path
475
+ - Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py)
476
+ - Can be overridden by uploaded file
477
+
478
+ ---
479
+
480
+ ## 🐛 Error Handling
481
+
482
+ ### At Each Level:
483
+
484
+ 1. **app.py** (Lines 82-86):
485
+ - Try/except around `crew.kickoff()`
486
+ - Returns error message with traceback
487
+
488
+ 2. **Tools** (tools.py):
489
+ - Each tool has try/except
490
+ - Returns error message if fails
491
+
492
+ 3. **Vector DB** (vector_db.py):
493
+ - Handles missing files
494
+ - Creates directory if needed
495
+ - Handles indexing errors
496
+
497
+ 4. **LLM** (config.py):
498
+ - Validates API keys
499
+ - Raises ValueError if missing
500
+ - Handles API errors
501
+
502
+ ---
503
+
504
+ ## 📝 Summary
505
+
506
+ **Input Flow**:
507
+ ```
508
+ User → Gradio → app.py → crew.py → agents.py → tasks.py → tools.py → data/LLM
509
+ ```
510
+
511
+ **Output Flow**:
512
+ ```
513
+ LLM/data → tools.py → agents.py → tasks.py → crew.py → app.py → Gradio → User
514
+ ```
515
+
516
+ **Key Points**:
517
+ - All agents share the same LLM instance
518
+ - Tools are stateless (read CSV each time)
519
+ - Vector DB is persistent (indexed once, reused)
520
+ - Tasks can run in parallel if no dependencies
521
+ - Results are aggregated and formatted in app.py
522
+
523
+ ---
524
+
525
+ **Last Updated**: Based on current codebase structure
526
+ **Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py
527
+
README.md CHANGED
@@ -9,46 +9,332 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # NBA Data Analysis with CrewAI
13
 
14
- An intelligent NBA data analysis application powered by CrewAI agents. Upload your NBA CSV data and get comprehensive analysis with insights, statistics, and engaging storylines.
15
 
16
- ## Features
17
 
 
18
  - 📊 **Data Engineering**: Automatic data cleaning and preparation
19
  - 🔍 **Intelligent Analysis**: AI-powered insights and pattern detection
20
  - 📈 **Statistical Analysis**: Top performers, trends, and key metrics
 
21
  - 📝 **Storytelling**: Engaging headlines and narratives from data
22
- - 🎯 **Semantic Search**: Natural language queries on your data
 
 
23
 
24
- ## How to Use
25
 
26
- 1. **Upload a CSV file** with NBA data
27
- 2. **Enter your analysis query** (or leave blank for comprehensive analysis)
28
- 3. **Click "Analyze Dataset"** and wait for results
29
- 4. **View insights** from Engineer, Analyst, and Storyteller agents
30
 
31
- ## Example Queries
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  - "Who are the top 5 three-point shooters?"
34
  - "Show me the best scoring games this season"
35
  - "Which players have the highest field goal percentage?"
36
  - "Analyze team performance trends"
 
 
37
 
38
- ## Technology Stack
39
 
40
- - **CrewAI**: Multi-agent AI framework
41
- - **Gradio**: Web interface
42
- - **Pandas**: Data analysis
43
- - **ChromaDB**: Vector database for semantic search
44
- - **OpenRouter**: Free open-source LLM access
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- ## Free to Use
47
 
48
- This application uses free-tier services:
49
- - OpenRouter for LLM access (free tier)
50
- - Hugging Face Spaces for hosting (free tier)
 
51
 
52
  ---
53
 
54
- Built with ❤️ using CrewAI
 
9
  pinned: false
10
  ---
11
 
12
+ # 🏀 NBA Data Analysis with CrewAI
13
 
14
+ An intelligent NBA data analysis application powered by CrewAI multi-agent framework. Upload your NBA CSV data and get comprehensive analysis with insights, statistics, and engaging storylines generated by AI agents.
15
 
16
+ ## Features
17
 
18
+ - 🤖 **Multi-Agent AI System**: Three specialized agents (Engineer, Analyst, Storyteller) work together
19
  - 📊 **Data Engineering**: Automatic data cleaning and preparation
20
  - 🔍 **Intelligent Analysis**: AI-powered insights and pattern detection
21
  - 📈 **Statistical Analysis**: Top performers, trends, and key metrics
22
+ - 🔎 **Semantic Search**: Natural language queries on your data using vector embeddings
23
  - 📝 **Storytelling**: Engaging headlines and narratives from data
24
+ - 🎯 **Parallel Processing**: Tasks run in parallel for faster results
25
+ - 🌐 **Web Interface**: Easy-to-use Gradio web app
26
+ - 🆓 **Free & Open Source**: Uses free-tier open-source LLM models
27
 
28
+ ## 🏗️ Architecture
29
 
30
+ The application uses a multi-agent system with the following components:
 
 
 
31
 
32
+ - **Data Engineer Agent**: Processes and validates data
33
+ - **Data Analyst Agent**: Performs statistical analysis and extracts insights
34
+ - **Storyteller Agent**: Creates engaging narratives from analysis results
35
+
36
+ ### Tech Stack
37
+
38
+ - **CrewAI**: Multi-agent AI framework
39
+ - **Gradio**: Web interface
40
+ - **Pandas**: Data analysis
41
+ - **ChromaDB**: Vector database for semantic search
42
+ - **Sentence Transformers**: Embeddings for semantic search
43
+ - **Hugging Face / Ollama**: Open-source LLM providers
44
+
45
+ ## 📋 Prerequisites
46
+
47
+ - Python 3.11 or 3.12
48
+ - pip or uv package manager
49
+ - (Optional) Ollama for local testing
50
+
51
+ ## 🚀 Installation
52
+
53
+ ### 1. Clone the Repository
54
+
55
+ ```bash
56
+ git clone <your-repo-url>
57
+ cd NBA_Analysis
58
+ ```
59
+
60
+ ### 2. Install Dependencies
61
+
62
+ **Using uv (recommended):**
63
+ ```bash
64
+ uv sync
65
+ ```
66
+
67
+ **Using pip:**
68
+ ```bash
69
+ pip install -r requirements.txt
70
+ ```
71
+
72
+ ### 3. Prepare Your Data
73
+
74
+ Place your NBA CSV file in the project directory, or upload it through the web interface.
75
+
76
+ ## ⚙️ Configuration
77
+
78
+ ### LLM Provider Setup
79
+
80
+ The application supports multiple LLM providers. Configure via environment variables:
81
+
82
+ #### Option 1: Hugging Face (Recommended for Deployment)
83
+
84
+ 1. Get a free API token from [Hugging Face](https://huggingface.co/settings/tokens)
85
+ 2. Set environment variables:
86
+ ```bash
87
+ export LLM_PROVIDER=huggingface
88
+ export HF_API_KEY=your-hf-token
89
+ export HF_MODEL=meta-llama/Llama-3.1-8B-Instruct # or any HF model
90
+ ```
91
+
92
+ **Available Models:**
93
+ - `meta-llama/Llama-3.1-8B-Instruct` (default, best quality)
94
+ - `mistralai/Mistral-7B-Instruct-v0.2` (excellent quality)
95
+ - `Qwen/Qwen2.5-7B-Instruct` (multilingual, great quality)
96
+ - `meta-llama/Llama-3.2-3B-Instruct` (faster, smaller)
97
+
98
+ #### Option 2: Ollama (For Local Testing)
99
+
100
+ 1. Install Ollama: https://ollama.ai
101
+ 2. Start Ollama service:
102
+ ```bash
103
+ ollama serve
104
+ ```
105
+ 3. Download a model:
106
+ ```bash
107
+ ollama pull mistral # or llama3.2, qwen2.5:7b, etc.
108
+ ```
109
+ 4. Set environment variables:
110
+ ```bash
111
+ export LLM_PROVIDER=ollama
112
+ export OLLAMA_MODEL=mistral
113
+ export OLLAMA_BASE_URL=http://localhost:11434/v1
114
+ ```
115
+
116
+ #### Option 3: OpenRouter (Alternative Free Option)
117
+
118
+ 1. Get a free API key from [OpenRouter](https://openrouter.ai)
119
+ 2. Set environment variables:
120
+ ```bash
121
+ export LLM_PROVIDER=openrouter
122
+ export OPENROUTER_API_KEY=your-key
123
+ export OPENROUTER_MODEL=google/gemma-2-2b-it:free
124
+ ```
125
+
126
+ ### Default Configuration
127
+
128
+ The application defaults to **Hugging Face** with **Llama 3.1 8B Instruct** model. No configuration needed if you set `HF_API_KEY`.
129
+
130
+ ## 🎮 Usage
131
+
132
+ ### Web Interface (Recommended)
133
+
134
+ ```bash
135
+ python app.py
136
+ ```
137
+
138
+ Then open your browser to the URL shown (usually `http://localhost:7860`).
139
+
140
+ **Features:**
141
+ - Upload CSV file
142
+ - Enter analysis query (or leave blank for comprehensive analysis)
143
+ - Click "Analyze Dataset" for full analysis
144
+ - Click "Analyze with Question" for quick queries
145
+
146
+ ### Command Line
147
+
148
+ ```bash
149
+ python main.py
150
+ ```
151
+
152
+ ## 📖 Example Queries
153
 
154
  - "Who are the top 5 three-point shooters?"
155
  - "Show me the best scoring games this season"
156
  - "Which players have the highest field goal percentage?"
157
  - "Analyze team performance trends"
158
+ - "Find games with triple doubles"
159
+ - "What are the most efficient shooters?"
160
 
161
+ ## 🛠️ Project Structure
162
 
163
+ ```
164
+ NBA_Analysis/
165
+ ├── app.py # Gradio web interface
166
+ ├── main.py # Command-line entry point
167
+ ├── config.py # LLM and configuration settings
168
+ ├── agents.py # AI agent definitions
169
+ ├── crew.py # CrewAI crew orchestration
170
+ ├── tasks.py # Task definitions
171
+ ├── tools.py # Data access tools for agents
172
+ ├── vector_db.py # Vector database for semantic search
173
+ ├── requirements.txt # Python dependencies
174
+ ├── pyproject.toml # Project configuration
175
+ ├── test_local.sh # Script for local testing with Ollama
176
+ ├── EXECUTION_FLOW.md # Detailed execution flow documentation
177
+ └── README.md # This file
178
+ ```
179
+
180
+ ## 🔧 Available Tools
181
+
182
+ The agents have access to 5 data tools:
183
+
184
+ 1. **read_nba_data**: Read sample rows to understand structure
185
+ 2. **search_nba_data**: Filter and search CSV data
186
+ 3. **get_nba_data_summary**: Get comprehensive dataset overview
187
+ 4. **semantic_search_nba_data**: Natural language semantic search
188
+ 5. **analyze_nba_data**: Execute pandas operations for advanced analysis
189
+
190
+ ## 🚀 Deployment
191
+
192
+ ### Hugging Face Spaces (Free)
193
+
194
+ 1. **Get API Keys:**
195
+ - Hugging Face token: https://huggingface.co/settings/tokens
196
+ - (Optional) OpenRouter key: https://openrouter.ai
197
+
198
+ 2. **Create Space:**
199
+ - Go to https://huggingface.co/spaces
200
+ - Create new Space with Gradio SDK
201
+ - Push your code
202
+
203
+ 3. **Set Secrets:**
204
+ - Space Settings → Repository secrets
205
+ - Add `HF_API_KEY` = your Hugging Face token
206
+ - (Optional) Add `LLM_PROVIDER` = `huggingface`
207
+ - (Optional) Add `HF_MODEL` = your preferred model
208
+
209
+ 4. **Deploy:**
210
+ ```bash
211
+ git remote add hf https://huggingface.co/spaces/yourusername/nba-analysis
212
+ git push hf main
213
+ ```
214
+
215
+ See `EXECUTION_FLOW.md` for detailed deployment instructions.
216
+
217
+ ## 🧪 Local Testing
218
+
219
+ ### Quick Test with Ollama
220
+
221
+ ```bash
222
+ # Make sure Ollama is running
223
+ ollama serve
224
+
225
+ # Run test script
226
+ ./test_local.sh
227
+ ```
228
+
229
+ Or manually:
230
+ ```bash
231
+ export LLM_PROVIDER=ollama
232
+ export OLLAMA_MODEL=mistral
233
+ export OLLAMA_BASE_URL=http://localhost:11434/v1
234
+ python app.py
235
+ ```
236
+
237
+ ## 📊 How It Works
238
+
239
+ 1. **User Input**: Upload CSV + enter query
240
+ 2. **Crew Creation**: Three agents are initialized with their roles
241
+ 3. **Parallel Execution**:
242
+ - Engineer validates data
243
+ - Analyst performs analysis (runs in parallel)
244
+ - Storyteller creates narrative (waits for Analyst)
245
+ 4. **Tool Execution**: Agents use tools to access and analyze data
246
+ 5. **LLM Processing**: AI generates insights and responses
247
+ 6. **Result Aggregation**: All outputs are combined and formatted
248
+ 7. **Display**: Results shown to user
249
+
250
+ See `EXECUTION_FLOW.md` for detailed flow documentation.
251
+
252
+ ## 🎯 Key Features Explained
253
+
254
+ ### Semantic Search
255
+ Uses vector embeddings to find semantically similar records. First run indexes the CSV, subsequent runs use cached embeddings.
256
+
257
+ ### Parallel Processing
258
+ Engineer and Analyst tasks run simultaneously for faster results. Storyteller waits for Analyst to complete.
259
+
260
+ ### Multi-Agent Collaboration
261
+ Each agent has a specialized role:
262
+ - **Engineer**: Data quality and structure
263
+ - **Analyst**: Statistical analysis and insights
264
+ - **Storyteller**: Narrative and presentation
265
+
266
+ ## 🔒 Environment Variables
267
+
268
+ | Variable | Description | Default |
269
+ |----------|-------------|---------|
270
+ | `LLM_PROVIDER` | LLM provider (`huggingface`, `ollama`, `openrouter`) | `huggingface` |
271
+ | `HF_API_KEY` | Hugging Face API token | Required if using HF |
272
+ | `HF_MODEL` | Hugging Face model name | `meta-llama/Llama-3.1-8B-Instruct` |
273
+ | `OLLAMA_MODEL` | Ollama model name | `mistral` |
274
+ | `OLLAMA_BASE_URL` | Ollama server URL | `http://localhost:11434/v1` |
275
+ | `OPENROUTER_API_KEY` | OpenRouter API key | Required if using OpenRouter |
276
+ | `OPENROUTER_MODEL` | OpenRouter model name | `google/gemma-2-2b-it:free` |
277
+
278
+ ## 🐛 Troubleshooting
279
+
280
+ ### "ModuleNotFoundError: No module named 'crewai'"
281
+ - Install dependencies: `pip install -r requirements.txt` or `uv sync`
282
+
283
+ ### "HF_API_KEY not set"
284
+ - Set your Hugging Face token as environment variable or in Space secrets
285
+
286
+ ### "Connection refused" (Ollama)
287
+ - Make sure `ollama serve` is running
288
+ - Check port 11434 is available
289
+
290
+ ### "Model not found" (Ollama)
291
+ - Download the model: `ollama pull mistral`
292
+ - List models: `ollama list`
293
+
294
+ ### Slow responses
295
+ - Use smaller models (Llama 3.2 3B instead of 8B)
296
+ - Check your internet connection for API calls
297
+ - For local: Use faster models like `llama3.2`
298
+
299
+ ## 📝 License
300
+
301
+ This project is open source. Check individual dependencies for their licenses.
302
+
303
+ ## 🤝 Contributing
304
+
305
+ Contributions are welcome! Please feel free to submit a Pull Request.
306
+
307
+ ## 📚 Documentation
308
+
309
+ - **Execution Flow**: See `EXECUTION_FLOW.md` for detailed flow
310
+ - **CrewAI Docs**: https://docs.crewai.com
311
+ - **Gradio Docs**: https://gradio.app/docs
312
+
313
+ ## 🎓 What Was Built
314
+
315
+ This project demonstrates:
316
+ - Multi-agent AI systems with CrewAI
317
+ - Parallel task execution
318
+ - Semantic search with vector databases
319
+ - Integration with multiple LLM providers
320
+ - Web interface with Gradio
321
+ - Free-tier deployment on Hugging Face Spaces
322
+
323
+ ## 💡 Tips
324
+
325
+ - **First Run**: Vector DB indexing takes time on first use
326
+ - **Large Files**: Use semantic search for large datasets
327
+ - **Complex Queries**: Use "Analyze with Question" for specific queries
328
+ - **Model Selection**: Larger models = better quality, slower speed
329
+ - **Local Testing**: Use Ollama for faster iteration
330
 
331
+ ## 🔗 Links
332
 
333
+ - **Hugging Face**: https://huggingface.co
334
+ - **Ollama**: https://ollama.ai
335
+ - **OpenRouter**: https://openrouter.ai
336
+ - **CrewAI**: https://docs.crewai.com
337
 
338
  ---
339
 
340
+ **Built with ❤️ using CrewAI and open-source LLMs**