NBA_Analysis / EXECUTION_FLOW.md
shekkari21's picture
added readme
b34fde9
# Detailed Execution Flow - NBA Analysis Application
This document explains step-by-step how user input flows through the application and gets executed.
---
## 🎯 High-Level Flow Overview
```
User Input (CSV + Query)
↓
app.py (Gradio Interface)
↓
crew.py (CrewAI Orchestration)
↓
agents.py (AI Agents)
↓
tasks.py (Task Definitions)
↓
tools.py (Data Access Tools)
↓
vector_db.py / pandas (Data Processing)
↓
config.py (LLM Configuration)
↓
LLM API (Hugging Face / Ollama / etc.)
↓
Results β†’ User
```
---
## πŸ“‹ Detailed Step-by-Step Execution
### **Phase 1: User Input & Initialization**
#### Step 1.1: User Interaction (`app.py`)
- **File**: `app.py`
- **Function**: `process_file_and_analyze()` or `process_question_only()`
- **Input**:
- CSV file (uploaded via Gradio)
- User query (optional text)
- **What happens**:
```python
# Line 23-24: Validate file exists
if file is None:
return "Please upload a CSV file."
# Line 27-28: Set default query if empty
if not user_query:
user_query = "Provide comprehensive analysis..."
# Line 32-33: Extract file path
file_path = file.name
csv_path = file_path
```
#### Step 1.2: Crew Creation (`crew.py`)
- **File**: `crew.py`
- **Function**: `create_flow_crew(user_query, csv_path)`
- **What happens**:
```python
# Line 82-84: Create all agents
engineer_agent = create_engineer_agent(csv_path)
analyst_agent = create_analyst_agent(csv_path)
storyteller_agent = create_storyteller_agent()
# Line 88-94: Create tasks
data_engineering_task = create_data_engineering_task(...)
custom_analysis_task = create_custom_analysis_task(...)
storyteller_task = create_storyteller_task(...)
# Line 99-104: Create Crew with agents and tasks
return Crew(agents=[...], tasks=[...], process=Process.sequential)
```
---
### **Phase 2: Agent Initialization**
#### Step 2.1: LLM Configuration (`config.py`)
- **File**: `config.py`
- **Function**: `get_llm()`
- **What happens**:
```python
# Line 13: Check provider (default: "huggingface")
LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")
# Line 54-64: Create LLM instance based on provider
if LLM_PROVIDER == "huggingface":
return LLM(
model=f"huggingface/{HF_MODEL}",
api_key=HF_API_KEY
)
# Similar for ollama, openrouter, etc.
```
- **Output**: Configured LLM instance (used by all agents)
#### Step 2.2: Agent Creation (`agents.py`)
- **File**: `agents.py`
- **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()`
- **What happens**:
**Engineer Agent** (Lines 12-36):
```python
# Line 22-23: Get data path and tools
data_path = csv_path or NBA_DATA_PATH
agent_tools = get_agent_tools(data_path)
# Line 25-36: Create agent with:
- role: "Data Engineer"
- goal: Process and clean data
- backstory: Expert data engineer description
- llm: Shared LLM instance
- tools: Data access tools (read, search, analyze)
```
**Analyst Agent** (Lines 39-69):
```python
# Similar structure but with:
- role: "Data Analyst"
- goal: Extract insights and patterns
- backstory: Includes instructions to use analyze_nba_data for aggregations
- tools: Same data tools
```
**Storyteller Agent** (Lines 72-93):
```python
- role: "Sports Storyteller"
- goal: Create engaging headlines from analysis
- tools: [] (no data tools, only uses LLM)
```
#### Step 2.3: Tools Initialization (`tools.py`)
- **File**: `tools.py`
- **Function**: `get_agent_tools(data_path)`
- **What happens**:
```python
# Returns list of 5 tools:
1. read_nba_data(limit) - Read sample rows
2. search_nba_data(query, column, value) - Filter/search CSV
3. get_nba_data_summary() - Get dataset overview
4. semantic_search_nba_data(query) - Vector search
5. analyze_nba_data(pandas_code) - Execute pandas operations
```
- **Note**: Each tool is wrapped with `@tool` decorator for CrewAI
---
### **Phase 3: Task Execution**
#### Step 3.1: Crew Kickoff (`app.py` β†’ `crew.py`)
- **File**: `app.py` Line 36-37
- **What happens**:
```python
crew = create_flow_crew(user_query.strip(), csv_path)
result = crew.kickoff() # This triggers execution
```
#### Step 3.2: Task 1 - Data Engineering (`tasks.py`)
- **File**: `tasks.py` Lines 8-40
- **Task**: `create_data_engineering_task()`
- **Agent**: Engineer Agent
- **Execution Flow**:
```
1. Engineer Agent receives task description
2. LLM processes task: "Examine dataset, get summary..."
3. Agent decides to use: get_nba_data_summary()
4. Tool execution (tools.py):
- Reads CSV with pandas
- Calculates stats (rows, columns, unique values)
- Returns formatted summary
5. LLM receives tool output
6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
7. Task complete β†’ Output stored
```
#### Step 3.3: Task 2 - Data Analysis (`tasks.py`)
- **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task)
- **Agent**: Analyst Agent
- **Execution Flow**:
```
1. Analyst Agent receives user query + task description
2. LLM analyzes query: "What does user want?"
3. Agent decides which tools to use:
- For aggregations β†’ analyze_nba_data()
- For searches β†’ search_nba_data() or semantic_search_nba_data()
- For overview β†’ get_nba_data_summary()
4. Tool Execution Examples:
Example A: "Top 5 three-point shooters"
- Agent generates pandas code:
df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
- analyze_nba_data() executes code
- Returns DataFrame with results
- LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."
Example B: "Find LeBron James games"
- Agent uses search_nba_data(query="LeBron James")
- Tool filters CSV, returns matching rows
- LLM analyzes results, provides insights
Example C: "High scoring games"
- Agent uses semantic_search_nba_data("high scoring games")
- Vector DB finds semantically similar records
- Returns top matches with similarity scores
- LLM provides analysis
5. LLM generates final analysis report
6. Task complete β†’ Output stored
```
#### Step 3.4: Task 3 - Storytelling (`tasks.py`)
- **File**: `tasks.py` Lines 98-130 (create_storyteller_task)
- **Agent**: Storyteller Agent
- **Dependency**: Waits for Analyst task to complete
- **Execution Flow**:
```
1. Storyteller Agent receives Analyst's output as context
2. LLM processes: "Create engaging headline and story"
3. No tools used (only LLM)
4. LLM generates:
- Catchy headline
- Engaging narrative
- Context and insights
5. Task complete β†’ Output stored
```
---
### **Phase 4: Tool Execution Details**
#### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)
```
Input: limit (number of rows)
Execution:
1. pd.read_csv(data_path)
2. df.head(limit)
3. Format as string
Output: Sample rows with column names
```
#### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)
```
Input: query (text), column (name), value (filter)
Execution:
1. pd.read_csv(data_path)
2. Apply filters if provided
3. Text search across columns
4. Limit to 50 rows max
Output: Filtered DataFrame as string
```
#### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)
```
Input: None
Execution:
1. pd.read_csv(data_path)
2. Calculate: total rows, columns, unique players/teams
3. Get date range
4. Identify numeric columns
5. Show sample rows
Output: Comprehensive dataset summary
```
#### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)
```
Input: query (natural language)
Execution:
1. Get vector_db instance (vector_db.py)
2. Check if indexed (if not, index CSV)
3. Generate embedding for query
4. Search in ChromaDB
5. Return top N similar records
6. Load original CSV rows
Output: Similar records with metadata
```
**Vector DB Indexing** (`vector_db.py` Lines 94-156):
```
First time only:
1. Load SentenceTransformer model
2. Read CSV
3. For each row:
- Convert to text: "Player: X, Team: Y, Points: Z..."
- Generate embedding
- Store in ChromaDB with metadata
4. Persist to disk (chroma_db/)
```
#### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)
```
Input: pandas_code (string of pandas operations)
Execution:
1. Load CSV into DataFrame 'df'
2. Create safe namespace: {'pd': pandas, 'df': df}
3. Execute: exec(f"result = {pandas_code}", namespace)
4. Get result from namespace
5. Format output:
- DataFrame β†’ to_string()
- Series β†’ to_string()
- Limit to 50 rows if large
Output: Analysis results as string
```
---
### **Phase 5: LLM Interaction**
#### LLM Call Flow (`config.py` β†’ LLM API)
```
1. Agent needs to process task
2. Calls llm.call(prompt, ...)
3. config.py routes to provider:
Hugging Face:
- Format: huggingface/{model_name}
- API: https://api-inference.huggingface.co
- Request: POST with prompt
- Response: Generated text
Ollama:
- Base URL: http://localhost:11434/v1
- OpenAI-compatible API
- Request: POST /chat/completions
- Response: Generated text
OpenRouter:
- Base URL: https://openrouter.ai/api/v1
- Request: POST with model name
- Response: Generated text
4. LLM generates response
5. Response returned to agent
6. Agent processes response
7. Agent decides next action (use tool? finish? ask for clarification?)
```
---
### **Phase 6: Result Aggregation**
#### Result Collection (`app.py` Lines 39-80)
```
After crew.kickoff() completes:
1. Extract task outputs:
- result.tasks_output[0] β†’ Engineer result
- result.tasks_output[1] β†’ Analyst result
- result.tasks_output[2] β†’ Storyteller result
2. Format output:
- Add headers: "## Engineer Agent Results"
- Add separators: "---"
- Combine all outputs
3. Store engineer result for reuse
4. Return formatted string to Gradio
```
#### Gradio Display (`app.py` Lines 200-340)
```
1. User sees results in output textbox
2. Engineer result stored in hidden state
3. Can be reused for follow-up questions
```
---
## πŸ”„ Parallel Execution Flow
### How Tasks Run in Parallel (`crew.py` Lines 69-104)
```
Time β†’
β”‚
β”œβ”€ Task 1: Engineer (independent)
β”‚ └─ Uses: get_nba_data_summary()
β”‚
β”œβ”€ Task 2: Analyst (independent, runs in parallel)
β”‚ └─ Uses: analyze_nba_data() or search_nba_data()
β”‚
└─ Task 3: Storyteller (waits for Analyst)
└─ Uses: LLM only (no tools)
```
**Key Points**:
- Engineer and Analyst run **simultaneously** (no dependencies)
- Storyteller runs **after** Analyst completes (has dependency)
- CrewAI handles parallelization automatically
---
## πŸ“Š Data Flow Diagram
```
CSV File
↓
[pandas.read_csv()]
↓
DataFrame
↓
β”œβ”€β†’ Tools (read, search, analyze)
β”‚ ↓
β”‚ Results β†’ Agent β†’ LLM β†’ Response
β”‚
└─→ Vector DB (semantic search)
↓
[SentenceTransformer]
↓
Embeddings
↓
[ChromaDB]
↓
Similar Records β†’ Agent β†’ LLM β†’ Response
```
---
## 🎯 Example: Complete Execution Trace
### Input:
- CSV: `nba24-25.csv`
- Query: "Who are the top 5 three-point shooters?"
### Execution:
1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")`
2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")`
3. **agents.py**: Create Engineer, Analyst, Storyteller agents
4. **config.py**: `get_llm()` β†’ Returns Hugging Face LLM
5. **crew.kickoff()** starts
6. **Task 1 (Engineer)**:
- Agent: "I need to check the dataset"
- Tool: `get_nba_data_summary()`
- Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
- LLM: "Dataset loaded. 5000 rows, ready for analysis."
7. **Task 2 (Analyst)** - Runs in parallel:
- Agent: "User wants top 5 three-point shooters"
- Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")`
- Execution:
```python
df = pd.read_csv("nba24-25.csv")
result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
# Returns: Player1: 250, Player2: 245, ...
```
- LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."
8. **Task 3 (Storyteller)** - After Analyst:
- Agent receives Analyst output
- LLM: "πŸ€ **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..."
9. **app.py**: Combine all outputs
10. **Gradio**: Display to user
---
## πŸ”§ Key Configuration Points
### LLM Provider Selection (`config.py`)
- Environment variable: `LLM_PROVIDER`
- Options: `huggingface`, `ollama`, `openrouter`, `openai`
- Default: `huggingface`
### Model Selection
- Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`)
- Ollama: `OLLAMA_MODEL` (default: `mistral`)
- OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`)
### Data Path
- Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py)
- Can be overridden by uploaded file
---
## πŸ› Error Handling
### At Each Level:
1. **app.py** (Lines 82-86):
- Try/except around `crew.kickoff()`
- Returns error message with traceback
2. **Tools** (tools.py):
- Each tool has try/except
- Returns error message if fails
3. **Vector DB** (vector_db.py):
- Handles missing files
- Creates directory if needed
- Handles indexing errors
4. **LLM** (config.py):
- Validates API keys
- Raises ValueError if missing
- Handles API errors
---
## πŸ“ Summary
**Input Flow**:
```
User β†’ Gradio β†’ app.py β†’ crew.py β†’ agents.py β†’ tasks.py β†’ tools.py β†’ data/LLM
```
**Output Flow**:
```
LLM/data β†’ tools.py β†’ agents.py β†’ tasks.py β†’ crew.py β†’ app.py β†’ Gradio β†’ User
```
**Key Points**:
- All agents share the same LLM instance
- Tools are stateless (read CSV each time)
- Vector DB is persistent (indexed once, reused)
- Tasks can run in parallel if no dependencies
- Results are aggregated and formatted in app.py
---
**Last Updated**: Based on current codebase structure
**Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py