Spaces:

shekkari21
/

NBA_Analysis

Sleeping

App Files Files Community

NBA_Analysis / EXECUTION_FLOW.md

shekkari21

added readme

b34fde9 about 2 months ago

preview code

raw

history blame contribute delete

14.3 kB

	# Detailed Execution Flow - NBA Analysis Application

	This document explains step-by-step how user input flows through the application and gets executed.

	---

	## 🎯 High-Level Flow Overview

	```
	User Input (CSV + Query)
	↓
	app.py (Gradio Interface)
	↓
	crew.py (CrewAI Orchestration)
	↓
	agents.py (AI Agents)
	↓
	tasks.py (Task Definitions)
	↓
	tools.py (Data Access Tools)
	↓
	vector_db.py / pandas (Data Processing)
	↓
	config.py (LLM Configuration)
	↓
	LLM API (Hugging Face / Ollama / etc.)
	↓
	Results → User
	```

	---

	## 📋 Detailed Step-by-Step Execution

	### Phase 1: User Input & Initialization

	#### Step 1.1: User Interaction (`app.py`)
	- File: `app.py`
	- Function: `process_file_and_analyze()` or `process_question_only()`
	- Input:
	- CSV file (uploaded via Gradio)
	- User query (optional text)
	- What happens:
	```python
	# Line 23-24: Validate file exists
	if file is None:
	return "Please upload a CSV file."

	# Line 27-28: Set default query if empty
	if not user_query:
	user_query = "Provide comprehensive analysis..."

	# Line 32-33: Extract file path
	file_path = file.name
	csv_path = file_path
	```

	#### Step 1.2: Crew Creation (`crew.py`)
	- File: `crew.py`
	- Function: `create_flow_crew(user_query, csv_path)`
	- What happens:
	```python
	# Line 82-84: Create all agents
	engineer_agent = create_engineer_agent(csv_path)
	analyst_agent = create_analyst_agent(csv_path)
	storyteller_agent = create_storyteller_agent()

	# Line 88-94: Create tasks
	data_engineering_task = create_data_engineering_task(...)
	custom_analysis_task = create_custom_analysis_task(...)
	storyteller_task = create_storyteller_task(...)

	# Line 99-104: Create Crew with agents and tasks
	return Crew(agents=[...], tasks=[...], process=Process.sequential)
	```

	---

	### Phase 2: Agent Initialization

	#### Step 2.1: LLM Configuration (`config.py`)
	- File: `config.py`
	- Function: `get_llm()`
	- What happens:
	```python
	# Line 13: Check provider (default: "huggingface")
	LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")

	# Line 54-64: Create LLM instance based on provider
	if LLM_PROVIDER == "huggingface":
	return LLM(
	model=f"huggingface/{HF_MODEL}",
	api_key=HF_API_KEY
	)
	# Similar for ollama, openrouter, etc.
	```
	- Output: Configured LLM instance (used by all agents)

	#### Step 2.2: Agent Creation (`agents.py`)
	- File: `agents.py`
	- Functions: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()`
	- What happens:

	Engineer Agent (Lines 12-36):
	```python
	# Line 22-23: Get data path and tools
	data_path = csv_path or NBA_DATA_PATH
	agent_tools = get_agent_tools(data_path)

	# Line 25-36: Create agent with:
	- role: "Data Engineer"
	- goal: Process and clean data
	- backstory: Expert data engineer description
	- llm: Shared LLM instance
	- tools: Data access tools (read, search, analyze)
	```

	Analyst Agent (Lines 39-69):
	```python
	# Similar structure but with:
	- role: "Data Analyst"
	- goal: Extract insights and patterns
	- backstory: Includes instructions to use analyze_nba_data for aggregations
	- tools: Same data tools
	```

	Storyteller Agent (Lines 72-93):
	```python
	- role: "Sports Storyteller"
	- goal: Create engaging headlines from analysis
	- tools: [] (no data tools, only uses LLM)
	```

	#### Step 2.3: Tools Initialization (`tools.py`)
	- File: `tools.py`
	- Function: `get_agent_tools(data_path)`
	- What happens:
	```python
	# Returns list of 5 tools:
	1. read_nba_data(limit) - Read sample rows
	2. search_nba_data(query, column, value) - Filter/search CSV
	3. get_nba_data_summary() - Get dataset overview
	4. semantic_search_nba_data(query) - Vector search
	5. analyze_nba_data(pandas_code) - Execute pandas operations
	```
	- Note: Each tool is wrapped with `@tool` decorator for CrewAI

	---

	### Phase 3: Task Execution

	#### Step 3.1: Crew Kickoff (`app.py` → `crew.py`)
	- File: `app.py` Line 36-37
	- What happens:
	```python
	crew = create_flow_crew(user_query.strip(), csv_path)
	result = crew.kickoff() # This triggers execution
	```

	#### Step 3.2: Task 1 - Data Engineering (`tasks.py`)
	- File: `tasks.py` Lines 8-40
	- Task: `create_data_engineering_task()`
	- Agent: Engineer Agent
	- Execution Flow:
	```
	1. Engineer Agent receives task description
	2. LLM processes task: "Examine dataset, get summary..."
	3. Agent decides to use: get_nba_data_summary()
	4. Tool execution (tools.py):
	- Reads CSV with pandas
	- Calculates stats (rows, columns, unique values)
	- Returns formatted summary
	5. LLM receives tool output
	6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
	7. Task complete → Output stored
	```

	#### Step 3.3: Task 2 - Data Analysis (`tasks.py`)
	- File: `tasks.py` Lines 55-95 (create_custom_analysis_task)
	- Agent: Analyst Agent
	- Execution Flow:
	```
	1. Analyst Agent receives user query + task description
	2. LLM analyzes query: "What does user want?"
	3. Agent decides which tools to use:
	- For aggregations → analyze_nba_data()
	- For searches → search_nba_data() or semantic_search_nba_data()
	- For overview → get_nba_data_summary()

	4. Tool Execution Examples:

	Example A: "Top 5 three-point shooters"
	- Agent generates pandas code:
	df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
	- analyze_nba_data() executes code
	- Returns DataFrame with results
	- LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."

	Example B: "Find LeBron James games"
	- Agent uses search_nba_data(query="LeBron James")
	- Tool filters CSV, returns matching rows
	- LLM analyzes results, provides insights

	Example C: "High scoring games"
	- Agent uses semantic_search_nba_data("high scoring games")
	- Vector DB finds semantically similar records
	- Returns top matches with similarity scores
	- LLM provides analysis

	5. LLM generates final analysis report
	6. Task complete → Output stored
	```

	#### Step 3.4: Task 3 - Storytelling (`tasks.py`)
	- File: `tasks.py` Lines 98-130 (create_storyteller_task)
	- Agent: Storyteller Agent
	- Dependency: Waits for Analyst task to complete
	- Execution Flow:
	```
	1. Storyteller Agent receives Analyst's output as context
	2. LLM processes: "Create engaging headline and story"
	3. No tools used (only LLM)
	4. LLM generates:
	- Catchy headline
	- Engaging narrative
	- Context and insights
	5. Task complete → Output stored
	```

	---

	### Phase 4: Tool Execution Details

	#### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)
	```
	Input: limit (number of rows)
	Execution:
	1. pd.read_csv(data_path)
	2. df.head(limit)
	3. Format as string
	Output: Sample rows with column names
	```

	#### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)
	```
	Input: query (text), column (name), value (filter)
	Execution:
	1. pd.read_csv(data_path)
	2. Apply filters if provided
	3. Text search across columns
	4. Limit to 50 rows max
	Output: Filtered DataFrame as string
	```

	#### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)
	```
	Input: None
	Execution:
	1. pd.read_csv(data_path)
	2. Calculate: total rows, columns, unique players/teams
	3. Get date range
	4. Identify numeric columns
	5. Show sample rows
	Output: Comprehensive dataset summary
	```

	#### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)
	```
	Input: query (natural language)
	Execution:
	1. Get vector_db instance (vector_db.py)
	2. Check if indexed (if not, index CSV)
	3. Generate embedding for query
	4. Search in ChromaDB
	5. Return top N similar records
	6. Load original CSV rows
	Output: Similar records with metadata
	```

	Vector DB Indexing (`vector_db.py` Lines 94-156):
	```
	First time only:
	1. Load SentenceTransformer model
	2. Read CSV
	3. For each row:
	- Convert to text: "Player: X, Team: Y, Points: Z..."
	- Generate embedding
	- Store in ChromaDB with metadata
	4. Persist to disk (chroma_db/)
	```

	#### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)
	```
	Input: pandas_code (string of pandas operations)
	Execution:
	1. Load CSV into DataFrame 'df'
	2. Create safe namespace: {'pd': pandas, 'df': df}
	3. Execute: exec(f"result = {pandas_code}", namespace)
	4. Get result from namespace
	5. Format output:
	- DataFrame → to_string()
	- Series → to_string()
	- Limit to 50 rows if large
	Output: Analysis results as string
	```

	---

	### Phase 5: LLM Interaction

	#### LLM Call Flow (`config.py` → LLM API)
	```
	1. Agent needs to process task
	2. Calls llm.call(prompt, ...)
	3. config.py routes to provider:

	Hugging Face:
	- Format: huggingface/{model_name}
	- API: https://api-inference.huggingface.co
	- Request: POST with prompt
	- Response: Generated text

	Ollama:
	- Base URL: http://localhost:11434/v1
	- OpenAI-compatible API
	- Request: POST /chat/completions
	- Response: Generated text

	OpenRouter:
	- Base URL: https://openrouter.ai/api/v1
	- Request: POST with model name
	- Response: Generated text

	4. LLM generates response
	5. Response returned to agent
	6. Agent processes response
	7. Agent decides next action (use tool? finish? ask for clarification?)
	```

	---

	### Phase 6: Result Aggregation

	#### Result Collection (`app.py` Lines 39-80)
	```
	After crew.kickoff() completes:

	1. Extract task outputs:
	- result.tasks_output[0] → Engineer result
	- result.tasks_output[1] → Analyst result
	- result.tasks_output[2] → Storyteller result

	2. Format output:
	- Add headers: "## Engineer Agent Results"
	- Add separators: "---"
	- Combine all outputs

	3. Store engineer result for reuse

	4. Return formatted string to Gradio
	```

	#### Gradio Display (`app.py` Lines 200-340)
	```
	1. User sees results in output textbox
	2. Engineer result stored in hidden state
	3. Can be reused for follow-up questions
	```

	---

	## 🔄 Parallel Execution Flow

	### How Tasks Run in Parallel (`crew.py` Lines 69-104)

	```
	Time →
	│
	├─ Task 1: Engineer (independent)
	│ └─ Uses: get_nba_data_summary()
	│
	├─ Task 2: Analyst (independent, runs in parallel)
	│ └─ Uses: analyze_nba_data() or search_nba_data()
	│
	└─ Task 3: Storyteller (waits for Analyst)
	└─ Uses: LLM only (no tools)
	```

	Key Points:
	- Engineer and Analyst run simultaneously (no dependencies)
	- Storyteller runs after Analyst completes (has dependency)
	- CrewAI handles parallelization automatically

	---

	## 📊 Data Flow Diagram

	```
	CSV File
	↓
	[pandas.read_csv()]
	↓
	DataFrame
	↓
	├─→ Tools (read, search, analyze)
	│ ↓
	│ Results → Agent → LLM → Response
	│
	└─→ Vector DB (semantic search)
	↓
	[SentenceTransformer]
	↓
	Embeddings
	↓
	[ChromaDB]
	↓
	Similar Records → Agent → LLM → Response
	```

	---

	## 🎯 Example: Complete Execution Trace

	### Input:
	- CSV: `nba24-25.csv`
	- Query: "Who are the top 5 three-point shooters?"

	### Execution:

	1. app.py: `process_file_and_analyze(file, "top 5 three-point shooters")`
	2. crew.py: `create_flow_crew("top 5...", "nba24-25.csv")`
	3. agents.py: Create Engineer, Analyst, Storyteller agents
	4. config.py: `get_llm()` → Returns Hugging Face LLM
	5. crew.kickoff() starts

	6. Task 1 (Engineer):
	- Agent: "I need to check the dataset"
	- Tool: `get_nba_data_summary()`
	- Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
	- LLM: "Dataset loaded. 5000 rows, ready for analysis."

	7. Task 2 (Analyst) - Runs in parallel:
	- Agent: "User wants top 5 three-point shooters"
	- Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")`
	- Execution:
	```python
	df = pd.read_csv("nba24-25.csv")
	result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
	# Returns: Player1: 250, Player2: 245, ...
	```
	- LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."

	8. Task 3 (Storyteller) - After Analyst:
	- Agent receives Analyst output
	- LLM: "🏀 Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed ..."

	9. app.py: Combine all outputs
	10. Gradio: Display to user

	---

	## 🔧 Key Configuration Points

	### LLM Provider Selection (`config.py`)
	- Environment variable: `LLM_PROVIDER`
	- Options: `huggingface`, `ollama`, `openrouter`, `openai`
	- Default: `huggingface`

	### Model Selection
	- Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`)
	- Ollama: `OLLAMA_MODEL` (default: `mistral`)
	- OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`)

	### Data Path
	- Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py)
	- Can be overridden by uploaded file

	---

	## 🐛 Error Handling

	### At Each Level:

	1. app.py (Lines 82-86):
	- Try/except around `crew.kickoff()`
	- Returns error message with traceback

	2. Tools (tools.py):
	- Each tool has try/except
	- Returns error message if fails

	3. Vector DB (vector_db.py):
	- Handles missing files
	- Creates directory if needed
	- Handles indexing errors

	4. LLM (config.py):
	- Validates API keys
	- Raises ValueError if missing
	- Handles API errors

	---

	## 📝 Summary

	Input Flow:
	```
	User → Gradio → app.py → crew.py → agents.py → tasks.py → tools.py → data/LLM
	```

	Output Flow:
	```
	LLM/data → tools.py → agents.py → tasks.py → crew.py → app.py → Gradio → User
	```

	Key Points:
	- All agents share the same LLM instance
	- Tools are stateless (read CSV each time)
	- Vector DB is persistent (indexed once, reused)
	- Tasks can run in parallel if no dependencies
	- Results are aggregated and formatted in app.py

	---

	Last Updated: Based on current codebase structure
	Files Involved: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py