File size: 14,324 Bytes
b34fde9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
# Detailed Execution Flow - NBA Analysis Application

This document explains step-by-step how user input flows through the application and gets executed.

---

## 🎯 High-Level Flow Overview

```
User Input (CSV + Query)
    ↓
app.py (Gradio Interface)
    ↓
crew.py (CrewAI Orchestration)
    ↓
agents.py (AI Agents)
    ↓
tasks.py (Task Definitions)
    ↓
tools.py (Data Access Tools)
    ↓
vector_db.py / pandas (Data Processing)
    ↓
config.py (LLM Configuration)
    ↓
LLM API (Hugging Face / Ollama / etc.)
    ↓
Results β†’ User
```

---

## πŸ“‹ Detailed Step-by-Step Execution

### **Phase 1: User Input & Initialization**

#### Step 1.1: User Interaction (`app.py`)
- **File**: `app.py`
- **Function**: `process_file_and_analyze()` or `process_question_only()`
- **Input**: 
  - CSV file (uploaded via Gradio)
  - User query (optional text)
- **What happens**:
  ```python
  # Line 23-24: Validate file exists
  if file is None:
      return "Please upload a CSV file."
  
  # Line 27-28: Set default query if empty
  if not user_query:
      user_query = "Provide comprehensive analysis..."
  
  # Line 32-33: Extract file path
  file_path = file.name
  csv_path = file_path
  ```

#### Step 1.2: Crew Creation (`crew.py`)
- **File**: `crew.py`
- **Function**: `create_flow_crew(user_query, csv_path)`
- **What happens**:
  ```python
  # Line 82-84: Create all agents
  engineer_agent = create_engineer_agent(csv_path)
  analyst_agent = create_analyst_agent(csv_path)
  storyteller_agent = create_storyteller_agent()
  
  # Line 88-94: Create tasks
  data_engineering_task = create_data_engineering_task(...)
  custom_analysis_task = create_custom_analysis_task(...)
  storyteller_task = create_storyteller_task(...)
  
  # Line 99-104: Create Crew with agents and tasks
  return Crew(agents=[...], tasks=[...], process=Process.sequential)
  ```

---

### **Phase 2: Agent Initialization**

#### Step 2.1: LLM Configuration (`config.py`)
- **File**: `config.py`
- **Function**: `get_llm()`
- **What happens**:
  ```python
  # Line 13: Check provider (default: "huggingface")
  LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")
  
  # Line 54-64: Create LLM instance based on provider
  if LLM_PROVIDER == "huggingface":
      return LLM(
          model=f"huggingface/{HF_MODEL}",
          api_key=HF_API_KEY
      )
  # Similar for ollama, openrouter, etc.
  ```
- **Output**: Configured LLM instance (used by all agents)

#### Step 2.2: Agent Creation (`agents.py`)
- **File**: `agents.py`
- **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()`
- **What happens**:

**Engineer Agent** (Lines 12-36):
  ```python
  # Line 22-23: Get data path and tools
  data_path = csv_path or NBA_DATA_PATH
  agent_tools = get_agent_tools(data_path)
  
  # Line 25-36: Create agent with:
  - role: "Data Engineer"
  - goal: Process and clean data
  - backstory: Expert data engineer description
  - llm: Shared LLM instance
  - tools: Data access tools (read, search, analyze)
  ```

**Analyst Agent** (Lines 39-69):
  ```python
  # Similar structure but with:
  - role: "Data Analyst"
  - goal: Extract insights and patterns
  - backstory: Includes instructions to use analyze_nba_data for aggregations
  - tools: Same data tools
  ```

**Storyteller Agent** (Lines 72-93):
  ```python
  - role: "Sports Storyteller"
  - goal: Create engaging headlines from analysis
  - tools: [] (no data tools, only uses LLM)
  ```

#### Step 2.3: Tools Initialization (`tools.py`)
- **File**: `tools.py`
- **Function**: `get_agent_tools(data_path)`
- **What happens**:
  ```python
  # Returns list of 5 tools:
  1. read_nba_data(limit) - Read sample rows
  2. search_nba_data(query, column, value) - Filter/search CSV
  3. get_nba_data_summary() - Get dataset overview
  4. semantic_search_nba_data(query) - Vector search
  5. analyze_nba_data(pandas_code) - Execute pandas operations
  ```
- **Note**: Each tool is wrapped with `@tool` decorator for CrewAI

---

### **Phase 3: Task Execution**

#### Step 3.1: Crew Kickoff (`app.py` β†’ `crew.py`)
- **File**: `app.py` Line 36-37
- **What happens**:
  ```python
  crew = create_flow_crew(user_query.strip(), csv_path)
  result = crew.kickoff()  # This triggers execution
  ```

#### Step 3.2: Task 1 - Data Engineering (`tasks.py`)
- **File**: `tasks.py` Lines 8-40
- **Task**: `create_data_engineering_task()`
- **Agent**: Engineer Agent
- **Execution Flow**:
  ```
  1. Engineer Agent receives task description
  2. LLM processes task: "Examine dataset, get summary..."
  3. Agent decides to use: get_nba_data_summary()
  4. Tool execution (tools.py):
     - Reads CSV with pandas
     - Calculates stats (rows, columns, unique values)
     - Returns formatted summary
  5. LLM receives tool output
  6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
  7. Task complete β†’ Output stored
  ```

#### Step 3.3: Task 2 - Data Analysis (`tasks.py`)
- **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task)
- **Agent**: Analyst Agent
- **Execution Flow**:
  ```
  1. Analyst Agent receives user query + task description
  2. LLM analyzes query: "What does user want?"
  3. Agent decides which tools to use:
     - For aggregations β†’ analyze_nba_data()
     - For searches β†’ search_nba_data() or semantic_search_nba_data()
     - For overview β†’ get_nba_data_summary()
  
  4. Tool Execution Examples:
  
  Example A: "Top 5 three-point shooters"
    - Agent generates pandas code:
      df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
    - analyze_nba_data() executes code
    - Returns DataFrame with results
    - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."
  
  Example B: "Find LeBron James games"
    - Agent uses search_nba_data(query="LeBron James")
    - Tool filters CSV, returns matching rows
    - LLM analyzes results, provides insights
  
  Example C: "High scoring games"
    - Agent uses semantic_search_nba_data("high scoring games")
    - Vector DB finds semantically similar records
    - Returns top matches with similarity scores
    - LLM provides analysis
  
  5. LLM generates final analysis report
  6. Task complete β†’ Output stored
  ```

#### Step 3.4: Task 3 - Storytelling (`tasks.py`)
- **File**: `tasks.py` Lines 98-130 (create_storyteller_task)
- **Agent**: Storyteller Agent
- **Dependency**: Waits for Analyst task to complete
- **Execution Flow**:
  ```
  1. Storyteller Agent receives Analyst's output as context
  2. LLM processes: "Create engaging headline and story"
  3. No tools used (only LLM)
  4. LLM generates:
     - Catchy headline
     - Engaging narrative
     - Context and insights
  5. Task complete β†’ Output stored
  ```

---

### **Phase 4: Tool Execution Details**

#### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)
```
Input: limit (number of rows)
Execution:
  1. pd.read_csv(data_path)
  2. df.head(limit)
  3. Format as string
Output: Sample rows with column names
```

#### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)
```
Input: query (text), column (name), value (filter)
Execution:
  1. pd.read_csv(data_path)
  2. Apply filters if provided
  3. Text search across columns
  4. Limit to 50 rows max
Output: Filtered DataFrame as string
```

#### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)
```
Input: None
Execution:
  1. pd.read_csv(data_path)
  2. Calculate: total rows, columns, unique players/teams
  3. Get date range
  4. Identify numeric columns
  5. Show sample rows
Output: Comprehensive dataset summary
```

#### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)
```
Input: query (natural language)
Execution:
  1. Get vector_db instance (vector_db.py)
  2. Check if indexed (if not, index CSV)
  3. Generate embedding for query
  4. Search in ChromaDB
  5. Return top N similar records
  6. Load original CSV rows
Output: Similar records with metadata
```

**Vector DB Indexing** (`vector_db.py` Lines 94-156):
```
First time only:
  1. Load SentenceTransformer model
  2. Read CSV
  3. For each row:
     - Convert to text: "Player: X, Team: Y, Points: Z..."
     - Generate embedding
     - Store in ChromaDB with metadata
  4. Persist to disk (chroma_db/)
```

#### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)
```
Input: pandas_code (string of pandas operations)
Execution:
  1. Load CSV into DataFrame 'df'
  2. Create safe namespace: {'pd': pandas, 'df': df}
  3. Execute: exec(f"result = {pandas_code}", namespace)
  4. Get result from namespace
  5. Format output:
     - DataFrame β†’ to_string()
     - Series β†’ to_string()
     - Limit to 50 rows if large
Output: Analysis results as string
```

---

### **Phase 5: LLM Interaction**

#### LLM Call Flow (`config.py` β†’ LLM API)
```
1. Agent needs to process task
2. Calls llm.call(prompt, ...)
3. config.py routes to provider:
   
   Hugging Face:
   - Format: huggingface/{model_name}
   - API: https://api-inference.huggingface.co
   - Request: POST with prompt
   - Response: Generated text
   
   Ollama:
   - Base URL: http://localhost:11434/v1
   - OpenAI-compatible API
   - Request: POST /chat/completions
   - Response: Generated text
   
   OpenRouter:
   - Base URL: https://openrouter.ai/api/v1
   - Request: POST with model name
   - Response: Generated text

4. LLM generates response
5. Response returned to agent
6. Agent processes response
7. Agent decides next action (use tool? finish? ask for clarification?)
```

---

### **Phase 6: Result Aggregation**

#### Result Collection (`app.py` Lines 39-80)
```
After crew.kickoff() completes:

1. Extract task outputs:
   - result.tasks_output[0] β†’ Engineer result
   - result.tasks_output[1] β†’ Analyst result
   - result.tasks_output[2] β†’ Storyteller result

2. Format output:
   - Add headers: "## Engineer Agent Results"
   - Add separators: "---"
   - Combine all outputs

3. Store engineer result for reuse

4. Return formatted string to Gradio
```

#### Gradio Display (`app.py` Lines 200-340)
```
1. User sees results in output textbox
2. Engineer result stored in hidden state
3. Can be reused for follow-up questions
```

---

## πŸ”„ Parallel Execution Flow

### How Tasks Run in Parallel (`crew.py` Lines 69-104)

```
Time β†’
β”‚
β”œβ”€ Task 1: Engineer (independent)
β”‚  └─ Uses: get_nba_data_summary()
β”‚
β”œβ”€ Task 2: Analyst (independent, runs in parallel)
β”‚  └─ Uses: analyze_nba_data() or search_nba_data()
β”‚
└─ Task 3: Storyteller (waits for Analyst)
   └─ Uses: LLM only (no tools)
```

**Key Points**:
- Engineer and Analyst run **simultaneously** (no dependencies)
- Storyteller runs **after** Analyst completes (has dependency)
- CrewAI handles parallelization automatically

---

## πŸ“Š Data Flow Diagram

```
CSV File
    ↓
[pandas.read_csv()]
    ↓
DataFrame
    ↓
    β”œβ”€β†’ Tools (read, search, analyze)
    β”‚       ↓
    β”‚   Results β†’ Agent β†’ LLM β†’ Response
    β”‚
    └─→ Vector DB (semantic search)
            ↓
        [SentenceTransformer]
            ↓
        Embeddings
            ↓
        [ChromaDB]
            ↓
        Similar Records β†’ Agent β†’ LLM β†’ Response
```

---

## 🎯 Example: Complete Execution Trace

### Input:
- CSV: `nba24-25.csv`
- Query: "Who are the top 5 three-point shooters?"

### Execution:

1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")`
2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")`
3. **agents.py**: Create Engineer, Analyst, Storyteller agents
4. **config.py**: `get_llm()` β†’ Returns Hugging Face LLM
5. **crew.kickoff()** starts

6. **Task 1 (Engineer)**:
   - Agent: "I need to check the dataset"
   - Tool: `get_nba_data_summary()`
   - Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
   - LLM: "Dataset loaded. 5000 rows, ready for analysis."

7. **Task 2 (Analyst)** - Runs in parallel:
   - Agent: "User wants top 5 three-point shooters"
   - Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")`
   - Execution:
     ```python
     df = pd.read_csv("nba24-25.csv")
     result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
     # Returns: Player1: 250, Player2: 245, ...
     ```
   - LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."

8. **Task 3 (Storyteller)** - After Analyst:
   - Agent receives Analyst output
   - LLM: "πŸ€ **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..."

9. **app.py**: Combine all outputs
10. **Gradio**: Display to user

---

## πŸ”§ Key Configuration Points

### LLM Provider Selection (`config.py`)
- Environment variable: `LLM_PROVIDER`
- Options: `huggingface`, `ollama`, `openrouter`, `openai`
- Default: `huggingface`

### Model Selection
- Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`)
- Ollama: `OLLAMA_MODEL` (default: `mistral`)
- OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`)

### Data Path
- Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py)
- Can be overridden by uploaded file

---

## πŸ› Error Handling

### At Each Level:

1. **app.py** (Lines 82-86):
   - Try/except around `crew.kickoff()`
   - Returns error message with traceback

2. **Tools** (tools.py):
   - Each tool has try/except
   - Returns error message if fails

3. **Vector DB** (vector_db.py):
   - Handles missing files
   - Creates directory if needed
   - Handles indexing errors

4. **LLM** (config.py):
   - Validates API keys
   - Raises ValueError if missing
   - Handles API errors

---

## πŸ“ Summary

**Input Flow**:
```
User β†’ Gradio β†’ app.py β†’ crew.py β†’ agents.py β†’ tasks.py β†’ tools.py β†’ data/LLM
```

**Output Flow**:
```
LLM/data β†’ tools.py β†’ agents.py β†’ tasks.py β†’ crew.py β†’ app.py β†’ Gradio β†’ User
```

**Key Points**:
- All agents share the same LLM instance
- Tools are stateless (read CSV each time)
- Vector DB is persistent (indexed once, reused)
- Tasks can run in parallel if no dependencies
- Results are aggregated and formatted in app.py

---

**Last Updated**: Based on current codebase structure
**Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py