Spaces:

shekkari21
/

NBA_Analysis

Sleeping

App Files Files Community

NBA_Analysis / EXECUTION_FLOW.md

shekkari21

added readme

b34fde9 about 2 months ago

preview code

raw

history blame contribute delete

14.3 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Detailed Execution Flow - NBA Analysis Application

This document explains step-by-step how user input flows through the application and gets executed.

🎯 High-Level Flow Overview

User Input (CSV + Query)
    ↓
app.py (Gradio Interface)
    ↓
crew.py (CrewAI Orchestration)
    ↓
agents.py (AI Agents)
    ↓
tasks.py (Task Definitions)
    ↓
tools.py (Data Access Tools)
    ↓
vector_db.py / pandas (Data Processing)
    ↓
config.py (LLM Configuration)
    ↓
LLM API (Hugging Face / Ollama / etc.)
    ↓
Results → User

📋 Detailed Step-by-Step Execution

Phase 1: User Input & Initialization

Step 1.1: User Interaction (`app.py`)

File: app.py
Function: process_file_and_analyze() or process_question_only()
Input:
- CSV file (uploaded via Gradio)
- User query (optional text)

What happens:

# Line 23-24: Validate file exists
if file is None:
    return "Please upload a CSV file."

# Line 27-28: Set default query if empty
if not user_query:
    user_query = "Provide comprehensive analysis..."

# Line 32-33: Extract file path
file_path = file.name
csv_path = file_path

Step 1.2: Crew Creation (`crew.py`)

File: crew.py
Function: create_flow_crew(user_query, csv_path)

What happens:

# Line 82-84: Create all agents
engineer_agent = create_engineer_agent(csv_path)
analyst_agent = create_analyst_agent(csv_path)
storyteller_agent = create_storyteller_agent()

# Line 88-94: Create tasks
data_engineering_task = create_data_engineering_task(...)
custom_analysis_task = create_custom_analysis_task(...)
storyteller_task = create_storyteller_task(...)

# Line 99-104: Create Crew with agents and tasks
return Crew(agents=[...], tasks=[...], process=Process.sequential)

Phase 2: Agent Initialization

Step 2.1: LLM Configuration (`config.py`)

File: config.py
Function: get_llm()

What happens:

# Line 13: Check provider (default: "huggingface")
LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")

# Line 54-64: Create LLM instance based on provider
if LLM_PROVIDER == "huggingface":
    return LLM(
        model=f"huggingface/{HF_MODEL}",
        api_key=HF_API_KEY
    )
# Similar for ollama, openrouter, etc.

Output: Configured LLM instance (used by all agents)

Step 2.2: Agent Creation (`agents.py`)

File: agents.py
Functions: create_engineer_agent(), create_analyst_agent(), create_storyteller_agent()
What happens:

Engineer Agent (Lines 12-36):

# Line 22-23: Get data path and tools
data_path = csv_path or NBA_DATA_PATH
agent_tools = get_agent_tools(data_path)

# Line 25-36: Create agent with:
- role: "Data Engineer"
- goal: Process and clean data
- backstory: Expert data engineer description
- llm: Shared LLM instance
- tools: Data access tools (read, search, analyze)

Analyst Agent (Lines 39-69):

# Similar structure but with:
- role: "Data Analyst"
- goal: Extract insights and patterns
- backstory: Includes instructions to use analyze_nba_data for aggregations
- tools: Same data tools

Storyteller Agent (Lines 72-93):

- role: "Sports Storyteller"
- goal: Create engaging headlines from analysis
- tools: [] (no data tools, only uses LLM)

Step 2.3: Tools Initialization (`tools.py`)

File: tools.py
Function: get_agent_tools(data_path)

What happens:

# Returns list of 5 tools:
1. read_nba_data(limit) - Read sample rows
2. search_nba_data(query, column, value) - Filter/search CSV
3. get_nba_data_summary() - Get dataset overview
4. semantic_search_nba_data(query) - Vector search
5. analyze_nba_data(pandas_code) - Execute pandas operations

Note: Each tool is wrapped with @tool decorator for CrewAI

Phase 3: Task Execution

Step 3.1: Crew Kickoff (`app.py` → `crew.py`)

File: app.py Line 36-37

What happens:

crew = create_flow_crew(user_query.strip(), csv_path)
result = crew.kickoff()  # This triggers execution

Step 3.2: Task 1 - Data Engineering (`tasks.py`)

File: tasks.py Lines 8-40
Task: create_data_engineering_task()
Agent: Engineer Agent

Execution Flow:

1. Engineer Agent receives task description
2. LLM processes task: "Examine dataset, get summary..."
3. Agent decides to use: get_nba_data_summary()
4. Tool execution (tools.py):
   - Reads CSV with pandas
   - Calculates stats (rows, columns, unique values)
   - Returns formatted summary
5. LLM receives tool output
6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
7. Task complete → Output stored

Step 3.3: Task 2 - Data Analysis (`tasks.py`)

File: tasks.py Lines 55-95 (create_custom_analysis_task)
Agent: Analyst Agent

Execution Flow:

1. Analyst Agent receives user query + task description
2. LLM analyzes query: "What does user want?"
3. Agent decides which tools to use:
   - For aggregations → analyze_nba_data()
   - For searches → search_nba_data() or semantic_search_nba_data()
   - For overview → get_nba_data_summary()

4. Tool Execution Examples:

Example A: "Top 5 three-point shooters"
  - Agent generates pandas code:
    df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
  - analyze_nba_data() executes code
  - Returns DataFrame with results
  - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."

Example B: "Find LeBron James games"
  - Agent uses search_nba_data(query="LeBron James")
  - Tool filters CSV, returns matching rows
  - LLM analyzes results, provides insights

Example C: "High scoring games"
  - Agent uses semantic_search_nba_data("high scoring games")
  - Vector DB finds semantically similar records
  - Returns top matches with similarity scores
  - LLM provides analysis

5. LLM generates final analysis report
6. Task complete → Output stored

Step 3.4: Task 3 - Storytelling (`tasks.py`)

File: tasks.py Lines 98-130 (create_storyteller_task)
Agent: Storyteller Agent
Dependency: Waits for Analyst task to complete

Execution Flow:

1. Storyteller Agent receives Analyst's output as context
2. LLM processes: "Create engaging headline and story"
3. No tools used (only LLM)
4. LLM generates:
   - Catchy headline
   - Engaging narrative
   - Context and insights
5. Task complete → Output stored

Phase 4: Tool Execution Details

Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)

Input: limit (number of rows)
Execution:
  1. pd.read_csv(data_path)
  2. df.head(limit)
  3. Format as string
Output: Sample rows with column names

Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)

Input: query (text), column (name), value (filter)
Execution:
  1. pd.read_csv(data_path)
  2. Apply filters if provided
  3. Text search across columns
  4. Limit to 50 rows max
Output: Filtered DataFrame as string

Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)

Input: None
Execution:
  1. pd.read_csv(data_path)
  2. Calculate: total rows, columns, unique players/teams
  3. Get date range
  4. Identify numeric columns
  5. Show sample rows
Output: Comprehensive dataset summary

Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)

Input: query (natural language)
Execution:
  1. Get vector_db instance (vector_db.py)
  2. Check if indexed (if not, index CSV)
  3. Generate embedding for query
  4. Search in ChromaDB
  5. Return top N similar records
  6. Load original CSV rows
Output: Similar records with metadata

Vector DB Indexing (vector_db.py Lines 94-156):

First time only:
  1. Load SentenceTransformer model
  2. Read CSV
  3. For each row:
     - Convert to text: "Player: X, Team: Y, Points: Z..."
     - Generate embedding
     - Store in ChromaDB with metadata
  4. Persist to disk (chroma_db/)

Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)

Input: pandas_code (string of pandas operations)
Execution:
  1. Load CSV into DataFrame 'df'
  2. Create safe namespace: {'pd': pandas, 'df': df}
  3. Execute: exec(f"result = {pandas_code}", namespace)
  4. Get result from namespace
  5. Format output:
     - DataFrame → to_string()
     - Series → to_string()
     - Limit to 50 rows if large
Output: Analysis results as string

Phase 5: LLM Interaction

LLM Call Flow (`config.py` → LLM API)

1. Agent needs to process task
2. Calls llm.call(prompt, ...)
3. config.py routes to provider:
   
   Hugging Face:
   - Format: huggingface/{model_name}
   - API: https://api-inference.huggingface.co
   - Request: POST with prompt
   - Response: Generated text
   
   Ollama:
   - Base URL: http://localhost:11434/v1
   - OpenAI-compatible API
   - Request: POST /chat/completions
   - Response: Generated text
   
   OpenRouter:
   - Base URL: https://openrouter.ai/api/v1
   - Request: POST with model name
   - Response: Generated text

4. LLM generates response
5. Response returned to agent
6. Agent processes response
7. Agent decides next action (use tool? finish? ask for clarification?)

Phase 6: Result Aggregation

Result Collection (`app.py` Lines 39-80)

After crew.kickoff() completes:

1. Extract task outputs:
   - result.tasks_output[0] → Engineer result
   - result.tasks_output[1] → Analyst result
   - result.tasks_output[2] → Storyteller result

2. Format output:
   - Add headers: "## Engineer Agent Results"
   - Add separators: "---"
   - Combine all outputs

3. Store engineer result for reuse

4. Return formatted string to Gradio

Gradio Display (`app.py` Lines 200-340)

1. User sees results in output textbox
2. Engineer result stored in hidden state
3. Can be reused for follow-up questions

🔄 Parallel Execution Flow

How Tasks Run in Parallel (`crew.py` Lines 69-104)

Time →
│
├─ Task 1: Engineer (independent)
│  └─ Uses: get_nba_data_summary()
│
├─ Task 2: Analyst (independent, runs in parallel)
│  └─ Uses: analyze_nba_data() or search_nba_data()
│
└─ Task 3: Storyteller (waits for Analyst)
   └─ Uses: LLM only (no tools)

Key Points:

Engineer and Analyst run simultaneously (no dependencies)
Storyteller runs after Analyst completes (has dependency)
CrewAI handles parallelization automatically

📊 Data Flow Diagram

CSV File
    ↓
[pandas.read_csv()]
    ↓
DataFrame
    ↓
    ├─→ Tools (read, search, analyze)
    │       ↓
    │   Results → Agent → LLM → Response
    │
    └─→ Vector DB (semantic search)
            ↓
        [SentenceTransformer]
            ↓
        Embeddings
            ↓
        [ChromaDB]
            ↓
        Similar Records → Agent → LLM → Response

🎯 Example: Complete Execution Trace

Input:

CSV: nba24-25.csv
Query: "Who are the top 5 three-point shooters?"

Execution:

app.py: process_file_and_analyze(file, "top 5 three-point shooters")
crew.py: create_flow_crew("top 5...", "nba24-25.csv")
agents.py: Create Engineer, Analyst, Storyteller agents
config.py: get_llm() → Returns Hugging Face LLM
crew.kickoff() starts
Task 1 (Engineer):
- Agent: "I need to check the dataset"
- Tool: get_nba_data_summary()
- Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
- LLM: "Dataset loaded. 5000 rows, ready for analysis."
Task 2 (Analyst) - Runs in parallel:
- Agent: "User wants top 5 three-point shooters"
- Tool: analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")
- Execution:
```
df = pd.read_csv("nba24-25.csv")
result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
# Returns: Player1: 250, Player2: 245, ...
```
- LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."
Task 3 (Storyteller) - After Analyst:
- Agent receives Analyst output
- LLM: "🏀 Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed ..."
app.py: Combine all outputs
Gradio: Display to user

🔧 Key Configuration Points

LLM Provider Selection (`config.py`)

Environment variable: LLM_PROVIDER
Options: huggingface, ollama, openrouter, openai
Default: huggingface

Model Selection

Hugging Face: HF_MODEL (default: meta-llama/Llama-3.1-8B-Instruct)
Ollama: OLLAMA_MODEL (default: mistral)
OpenRouter: OPENROUTER_MODEL (default: google/gemma-2-2b-it:free)

Data Path

Default: NBA_DATA_PATH = "nba24-25.csv" (config.py)
Can be overridden by uploaded file

🐛 Error Handling

At Each Level:

app.py (Lines 82-86):
- Try/except around crew.kickoff()
- Returns error message with traceback
Tools (tools.py):
- Each tool has try/except
- Returns error message if fails
Vector DB (vector_db.py):
- Handles missing files
- Creates directory if needed
- Handles indexing errors
LLM (config.py):
- Validates API keys
- Raises ValueError if missing
- Handles API errors

📝 Summary

Input Flow:

User → Gradio → app.py → crew.py → agents.py → tasks.py → tools.py → data/LLM

Output Flow:

LLM/data → tools.py → agents.py → tasks.py → crew.py → app.py → Gradio → User

Key Points:

All agents share the same LLM instance
Tools are stateless (read CSV each time)
Vector DB is persistent (indexed once, reused)
Tasks can run in parallel if no dependencies
Results are aggregated and formatted in app.py

Last Updated: Based on current codebase structure Files Involved: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py

Detailed Execution Flow - NBA Analysis Application

🎯 High-Level Flow Overview

📋 Detailed Step-by-Step Execution

Phase 1: User Input & Initialization

Step 1.1: User Interaction (app.py)

Step 1.2: Crew Creation (crew.py)

Phase 2: Agent Initialization

Step 2.1: LLM Configuration (config.py)

Step 2.2: Agent Creation (agents.py)

Step 2.3: Tools Initialization (tools.py)

Phase 3: Task Execution

Step 3.1: Crew Kickoff (app.py → crew.py)

Step 3.2: Task 1 - Data Engineering (tasks.py)

Step 3.3: Task 2 - Data Analysis (tasks.py)

Step 3.4: Task 3 - Storytelling (tasks.py)

Phase 4: Tool Execution Details

Tool 1: read_nba_data(limit) (tools.py Lines 22-30)

Tool 2: search_nba_data(query, column, value) (tools.py Lines 32-71)

Tool 3: get_nba_data_summary() (tools.py Lines 73-94)

Tool 4: semantic_search_nba_data(query) (tools.py Lines 135-175)

Tool 5: analyze_nba_data(pandas_code) (tools.py Lines 203-253)

Phase 5: LLM Interaction

LLM Call Flow (config.py → LLM API)

Phase 6: Result Aggregation

Result Collection (app.py Lines 39-80)

Gradio Display (app.py Lines 200-340)

🔄 Parallel Execution Flow

How Tasks Run in Parallel (crew.py Lines 69-104)

📊 Data Flow Diagram

🎯 Example: Complete Execution Trace

Input:

Execution:

🔧 Key Configuration Points

LLM Provider Selection (config.py)

Model Selection

Data Path

🐛 Error Handling

At Each Level:

📝 Summary

Step 1.1: User Interaction (`app.py`)

Step 1.2: Crew Creation (`crew.py`)

Step 2.1: LLM Configuration (`config.py`)

Step 2.2: Agent Creation (`agents.py`)

Step 2.3: Tools Initialization (`tools.py`)

Step 3.1: Crew Kickoff (`app.py` → `crew.py`)

Step 3.2: Task 1 - Data Engineering (`tasks.py`)

Step 3.3: Task 2 - Data Analysis (`tasks.py`)

Step 3.4: Task 3 - Storytelling (`tasks.py`)

Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)

Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)

Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)

Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)

Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)

LLM Call Flow (`config.py` → LLM API)

Result Collection (`app.py` Lines 39-80)

Gradio Display (`app.py` Lines 200-340)

How Tasks Run in Parallel (`crew.py` Lines 69-104)

LLM Provider Selection (`config.py`)