NBA_Analysis / EXECUTION_FLOW.md
shekkari21's picture
added readme
b34fde9

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Detailed Execution Flow - NBA Analysis Application

This document explains step-by-step how user input flows through the application and gets executed.


🎯 High-Level Flow Overview

User Input (CSV + Query)
    ↓
app.py (Gradio Interface)
    ↓
crew.py (CrewAI Orchestration)
    ↓
agents.py (AI Agents)
    ↓
tasks.py (Task Definitions)
    ↓
tools.py (Data Access Tools)
    ↓
vector_db.py / pandas (Data Processing)
    ↓
config.py (LLM Configuration)
    ↓
LLM API (Hugging Face / Ollama / etc.)
    ↓
Results β†’ User

πŸ“‹ Detailed Step-by-Step Execution

Phase 1: User Input & Initialization

Step 1.1: User Interaction (app.py)

  • File: app.py
  • Function: process_file_and_analyze() or process_question_only()
  • Input:
    • CSV file (uploaded via Gradio)
    • User query (optional text)
  • What happens:
    # Line 23-24: Validate file exists
    if file is None:
        return "Please upload a CSV file."
    
    # Line 27-28: Set default query if empty
    if not user_query:
        user_query = "Provide comprehensive analysis..."
    
    # Line 32-33: Extract file path
    file_path = file.name
    csv_path = file_path
    

Step 1.2: Crew Creation (crew.py)

  • File: crew.py
  • Function: create_flow_crew(user_query, csv_path)
  • What happens:
    # Line 82-84: Create all agents
    engineer_agent = create_engineer_agent(csv_path)
    analyst_agent = create_analyst_agent(csv_path)
    storyteller_agent = create_storyteller_agent()
    
    # Line 88-94: Create tasks
    data_engineering_task = create_data_engineering_task(...)
    custom_analysis_task = create_custom_analysis_task(...)
    storyteller_task = create_storyteller_task(...)
    
    # Line 99-104: Create Crew with agents and tasks
    return Crew(agents=[...], tasks=[...], process=Process.sequential)
    

Phase 2: Agent Initialization

Step 2.1: LLM Configuration (config.py)

  • File: config.py
  • Function: get_llm()
  • What happens:
    # Line 13: Check provider (default: "huggingface")
    LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")
    
    # Line 54-64: Create LLM instance based on provider
    if LLM_PROVIDER == "huggingface":
        return LLM(
            model=f"huggingface/{HF_MODEL}",
            api_key=HF_API_KEY
        )
    # Similar for ollama, openrouter, etc.
    
  • Output: Configured LLM instance (used by all agents)

Step 2.2: Agent Creation (agents.py)

  • File: agents.py
  • Functions: create_engineer_agent(), create_analyst_agent(), create_storyteller_agent()
  • What happens:

Engineer Agent (Lines 12-36):

# Line 22-23: Get data path and tools
data_path = csv_path or NBA_DATA_PATH
agent_tools = get_agent_tools(data_path)

# Line 25-36: Create agent with:
- role: "Data Engineer"
- goal: Process and clean data
- backstory: Expert data engineer description
- llm: Shared LLM instance
- tools: Data access tools (read, search, analyze)

Analyst Agent (Lines 39-69):

# Similar structure but with:
- role: "Data Analyst"
- goal: Extract insights and patterns
- backstory: Includes instructions to use analyze_nba_data for aggregations
- tools: Same data tools

Storyteller Agent (Lines 72-93):

- role: "Sports Storyteller"
- goal: Create engaging headlines from analysis
- tools: [] (no data tools, only uses LLM)

Step 2.3: Tools Initialization (tools.py)

  • File: tools.py
  • Function: get_agent_tools(data_path)
  • What happens:
    # Returns list of 5 tools:
    1. read_nba_data(limit) - Read sample rows
    2. search_nba_data(query, column, value) - Filter/search CSV
    3. get_nba_data_summary() - Get dataset overview
    4. semantic_search_nba_data(query) - Vector search
    5. analyze_nba_data(pandas_code) - Execute pandas operations
    
  • Note: Each tool is wrapped with @tool decorator for CrewAI

Phase 3: Task Execution

Step 3.1: Crew Kickoff (app.py β†’ crew.py)

  • File: app.py Line 36-37
  • What happens:
    crew = create_flow_crew(user_query.strip(), csv_path)
    result = crew.kickoff()  # This triggers execution
    

Step 3.2: Task 1 - Data Engineering (tasks.py)

  • File: tasks.py Lines 8-40
  • Task: create_data_engineering_task()
  • Agent: Engineer Agent
  • Execution Flow:
    1. Engineer Agent receives task description
    2. LLM processes task: "Examine dataset, get summary..."
    3. Agent decides to use: get_nba_data_summary()
    4. Tool execution (tools.py):
       - Reads CSV with pandas
       - Calculates stats (rows, columns, unique values)
       - Returns formatted summary
    5. LLM receives tool output
    6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
    7. Task complete β†’ Output stored
    

Step 3.3: Task 2 - Data Analysis (tasks.py)

  • File: tasks.py Lines 55-95 (create_custom_analysis_task)
  • Agent: Analyst Agent
  • Execution Flow:
    1. Analyst Agent receives user query + task description
    2. LLM analyzes query: "What does user want?"
    3. Agent decides which tools to use:
       - For aggregations β†’ analyze_nba_data()
       - For searches β†’ search_nba_data() or semantic_search_nba_data()
       - For overview β†’ get_nba_data_summary()
    
    4. Tool Execution Examples:
    
    Example A: "Top 5 three-point shooters"
      - Agent generates pandas code:
        df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
      - analyze_nba_data() executes code
      - Returns DataFrame with results
      - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."
    
    Example B: "Find LeBron James games"
      - Agent uses search_nba_data(query="LeBron James")
      - Tool filters CSV, returns matching rows
      - LLM analyzes results, provides insights
    
    Example C: "High scoring games"
      - Agent uses semantic_search_nba_data("high scoring games")
      - Vector DB finds semantically similar records
      - Returns top matches with similarity scores
      - LLM provides analysis
    
    5. LLM generates final analysis report
    6. Task complete β†’ Output stored
    

Step 3.4: Task 3 - Storytelling (tasks.py)

  • File: tasks.py Lines 98-130 (create_storyteller_task)
  • Agent: Storyteller Agent
  • Dependency: Waits for Analyst task to complete
  • Execution Flow:
    1. Storyteller Agent receives Analyst's output as context
    2. LLM processes: "Create engaging headline and story"
    3. No tools used (only LLM)
    4. LLM generates:
       - Catchy headline
       - Engaging narrative
       - Context and insights
    5. Task complete β†’ Output stored
    

Phase 4: Tool Execution Details

Tool 1: read_nba_data(limit) (tools.py Lines 22-30)

Input: limit (number of rows)
Execution:
  1. pd.read_csv(data_path)
  2. df.head(limit)
  3. Format as string
Output: Sample rows with column names

Tool 2: search_nba_data(query, column, value) (tools.py Lines 32-71)

Input: query (text), column (name), value (filter)
Execution:
  1. pd.read_csv(data_path)
  2. Apply filters if provided
  3. Text search across columns
  4. Limit to 50 rows max
Output: Filtered DataFrame as string

Tool 3: get_nba_data_summary() (tools.py Lines 73-94)

Input: None
Execution:
  1. pd.read_csv(data_path)
  2. Calculate: total rows, columns, unique players/teams
  3. Get date range
  4. Identify numeric columns
  5. Show sample rows
Output: Comprehensive dataset summary

Tool 4: semantic_search_nba_data(query) (tools.py Lines 135-175)

Input: query (natural language)
Execution:
  1. Get vector_db instance (vector_db.py)
  2. Check if indexed (if not, index CSV)
  3. Generate embedding for query
  4. Search in ChromaDB
  5. Return top N similar records
  6. Load original CSV rows
Output: Similar records with metadata

Vector DB Indexing (vector_db.py Lines 94-156):

First time only:
  1. Load SentenceTransformer model
  2. Read CSV
  3. For each row:
     - Convert to text: "Player: X, Team: Y, Points: Z..."
     - Generate embedding
     - Store in ChromaDB with metadata
  4. Persist to disk (chroma_db/)

Tool 5: analyze_nba_data(pandas_code) (tools.py Lines 203-253)

Input: pandas_code (string of pandas operations)
Execution:
  1. Load CSV into DataFrame 'df'
  2. Create safe namespace: {'pd': pandas, 'df': df}
  3. Execute: exec(f"result = {pandas_code}", namespace)
  4. Get result from namespace
  5. Format output:
     - DataFrame β†’ to_string()
     - Series β†’ to_string()
     - Limit to 50 rows if large
Output: Analysis results as string

Phase 5: LLM Interaction

LLM Call Flow (config.py β†’ LLM API)

1. Agent needs to process task
2. Calls llm.call(prompt, ...)
3. config.py routes to provider:
   
   Hugging Face:
   - Format: huggingface/{model_name}
   - API: https://api-inference.huggingface.co
   - Request: POST with prompt
   - Response: Generated text
   
   Ollama:
   - Base URL: http://localhost:11434/v1
   - OpenAI-compatible API
   - Request: POST /chat/completions
   - Response: Generated text
   
   OpenRouter:
   - Base URL: https://openrouter.ai/api/v1
   - Request: POST with model name
   - Response: Generated text

4. LLM generates response
5. Response returned to agent
6. Agent processes response
7. Agent decides next action (use tool? finish? ask for clarification?)

Phase 6: Result Aggregation

Result Collection (app.py Lines 39-80)

After crew.kickoff() completes:

1. Extract task outputs:
   - result.tasks_output[0] β†’ Engineer result
   - result.tasks_output[1] β†’ Analyst result
   - result.tasks_output[2] β†’ Storyteller result

2. Format output:
   - Add headers: "## Engineer Agent Results"
   - Add separators: "---"
   - Combine all outputs

3. Store engineer result for reuse

4. Return formatted string to Gradio

Gradio Display (app.py Lines 200-340)

1. User sees results in output textbox
2. Engineer result stored in hidden state
3. Can be reused for follow-up questions

πŸ”„ Parallel Execution Flow

How Tasks Run in Parallel (crew.py Lines 69-104)

Time β†’
β”‚
β”œβ”€ Task 1: Engineer (independent)
β”‚  └─ Uses: get_nba_data_summary()
β”‚
β”œβ”€ Task 2: Analyst (independent, runs in parallel)
β”‚  └─ Uses: analyze_nba_data() or search_nba_data()
β”‚
└─ Task 3: Storyteller (waits for Analyst)
   └─ Uses: LLM only (no tools)

Key Points:

  • Engineer and Analyst run simultaneously (no dependencies)
  • Storyteller runs after Analyst completes (has dependency)
  • CrewAI handles parallelization automatically

πŸ“Š Data Flow Diagram

CSV File
    ↓
[pandas.read_csv()]
    ↓
DataFrame
    ↓
    β”œβ”€β†’ Tools (read, search, analyze)
    β”‚       ↓
    β”‚   Results β†’ Agent β†’ LLM β†’ Response
    β”‚
    └─→ Vector DB (semantic search)
            ↓
        [SentenceTransformer]
            ↓
        Embeddings
            ↓
        [ChromaDB]
            ↓
        Similar Records β†’ Agent β†’ LLM β†’ Response

🎯 Example: Complete Execution Trace

Input:

  • CSV: nba24-25.csv
  • Query: "Who are the top 5 three-point shooters?"

Execution:

  1. app.py: process_file_and_analyze(file, "top 5 three-point shooters")

  2. crew.py: create_flow_crew("top 5...", "nba24-25.csv")

  3. agents.py: Create Engineer, Analyst, Storyteller agents

  4. config.py: get_llm() β†’ Returns Hugging Face LLM

  5. crew.kickoff() starts

  6. Task 1 (Engineer):

    • Agent: "I need to check the dataset"
    • Tool: get_nba_data_summary()
    • Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
    • LLM: "Dataset loaded. 5000 rows, ready for analysis."
  7. Task 2 (Analyst) - Runs in parallel:

    • Agent: "User wants top 5 three-point shooters"
    • Tool: analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")
    • Execution:
      df = pd.read_csv("nba24-25.csv")
      result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
      # Returns: Player1: 250, Player2: 245, ...
      
    • LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."
  8. Task 3 (Storyteller) - After Analyst:

    • Agent receives Analyst output
    • LLM: "πŸ€ Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed ..."
  9. app.py: Combine all outputs

  10. Gradio: Display to user


πŸ”§ Key Configuration Points

LLM Provider Selection (config.py)

  • Environment variable: LLM_PROVIDER
  • Options: huggingface, ollama, openrouter, openai
  • Default: huggingface

Model Selection

  • Hugging Face: HF_MODEL (default: meta-llama/Llama-3.1-8B-Instruct)
  • Ollama: OLLAMA_MODEL (default: mistral)
  • OpenRouter: OPENROUTER_MODEL (default: google/gemma-2-2b-it:free)

Data Path

  • Default: NBA_DATA_PATH = "nba24-25.csv" (config.py)
  • Can be overridden by uploaded file

πŸ› Error Handling

At Each Level:

  1. app.py (Lines 82-86):

    • Try/except around crew.kickoff()
    • Returns error message with traceback
  2. Tools (tools.py):

    • Each tool has try/except
    • Returns error message if fails
  3. Vector DB (vector_db.py):

    • Handles missing files
    • Creates directory if needed
    • Handles indexing errors
  4. LLM (config.py):

    • Validates API keys
    • Raises ValueError if missing
    • Handles API errors

πŸ“ Summary

Input Flow:

User β†’ Gradio β†’ app.py β†’ crew.py β†’ agents.py β†’ tasks.py β†’ tools.py β†’ data/LLM

Output Flow:

LLM/data β†’ tools.py β†’ agents.py β†’ tasks.py β†’ crew.py β†’ app.py β†’ Gradio β†’ User

Key Points:

  • All agents share the same LLM instance
  • Tools are stateless (read CSV each time)
  • Vector DB is persistent (indexed once, reused)
  • Tasks can run in parallel if no dependencies
  • Results are aggregated and formatted in app.py

Last Updated: Based on current codebase structure Files Involved: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py