# Detailed Execution Flow - NBA Analysis Application This document explains step-by-step how user input flows through the application and gets executed. --- ## 🎯 High-Level Flow Overview ``` User Input (CSV + Query) ↓ app.py (Gradio Interface) ↓ crew.py (CrewAI Orchestration) ↓ agents.py (AI Agents) ↓ tasks.py (Task Definitions) ↓ tools.py (Data Access Tools) ↓ vector_db.py / pandas (Data Processing) ↓ config.py (LLM Configuration) ↓ LLM API (Hugging Face / Ollama / etc.) ↓ Results → User ``` --- ## 📋 Detailed Step-by-Step Execution ### **Phase 1: User Input & Initialization** #### Step 1.1: User Interaction (`app.py`) - **File**: `app.py` - **Function**: `process_file_and_analyze()` or `process_question_only()` - **Input**: - CSV file (uploaded via Gradio) - User query (optional text) - **What happens**: ```python # Line 23-24: Validate file exists if file is None: return "Please upload a CSV file." # Line 27-28: Set default query if empty if not user_query: user_query = "Provide comprehensive analysis..." # Line 32-33: Extract file path file_path = file.name csv_path = file_path ``` #### Step 1.2: Crew Creation (`crew.py`) - **File**: `crew.py` - **Function**: `create_flow_crew(user_query, csv_path)` - **What happens**: ```python # Line 82-84: Create all agents engineer_agent = create_engineer_agent(csv_path) analyst_agent = create_analyst_agent(csv_path) storyteller_agent = create_storyteller_agent() # Line 88-94: Create tasks data_engineering_task = create_data_engineering_task(...) custom_analysis_task = create_custom_analysis_task(...) storyteller_task = create_storyteller_task(...) # Line 99-104: Create Crew with agents and tasks return Crew(agents=[...], tasks=[...], process=Process.sequential) ``` --- ### **Phase 2: Agent Initialization** #### Step 2.1: LLM Configuration (`config.py`) - **File**: `config.py` - **Function**: `get_llm()` - **What happens**: ```python # Line 13: Check provider (default: "huggingface") LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface") # Line 54-64: Create LLM instance based on provider if LLM_PROVIDER == "huggingface": return LLM( model=f"huggingface/{HF_MODEL}", api_key=HF_API_KEY ) # Similar for ollama, openrouter, etc. ``` - **Output**: Configured LLM instance (used by all agents) #### Step 2.2: Agent Creation (`agents.py`) - **File**: `agents.py` - **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()` - **What happens**: **Engineer Agent** (Lines 12-36): ```python # Line 22-23: Get data path and tools data_path = csv_path or NBA_DATA_PATH agent_tools = get_agent_tools(data_path) # Line 25-36: Create agent with: - role: "Data Engineer" - goal: Process and clean data - backstory: Expert data engineer description - llm: Shared LLM instance - tools: Data access tools (read, search, analyze) ``` **Analyst Agent** (Lines 39-69): ```python # Similar structure but with: - role: "Data Analyst" - goal: Extract insights and patterns - backstory: Includes instructions to use analyze_nba_data for aggregations - tools: Same data tools ``` **Storyteller Agent** (Lines 72-93): ```python - role: "Sports Storyteller" - goal: Create engaging headlines from analysis - tools: [] (no data tools, only uses LLM) ``` #### Step 2.3: Tools Initialization (`tools.py`) - **File**: `tools.py` - **Function**: `get_agent_tools(data_path)` - **What happens**: ```python # Returns list of 5 tools: 1. read_nba_data(limit) - Read sample rows 2. search_nba_data(query, column, value) - Filter/search CSV 3. get_nba_data_summary() - Get dataset overview 4. semantic_search_nba_data(query) - Vector search 5. analyze_nba_data(pandas_code) - Execute pandas operations ``` - **Note**: Each tool is wrapped with `@tool` decorator for CrewAI --- ### **Phase 3: Task Execution** #### Step 3.1: Crew Kickoff (`app.py` → `crew.py`) - **File**: `app.py` Line 36-37 - **What happens**: ```python crew = create_flow_crew(user_query.strip(), csv_path) result = crew.kickoff() # This triggers execution ``` #### Step 3.2: Task 1 - Data Engineering (`tasks.py`) - **File**: `tasks.py` Lines 8-40 - **Task**: `create_data_engineering_task()` - **Agent**: Engineer Agent - **Execution Flow**: ``` 1. Engineer Agent receives task description 2. LLM processes task: "Examine dataset, get summary..." 3. Agent decides to use: get_nba_data_summary() 4. Tool execution (tools.py): - Reads CSV with pandas - Calculates stats (rows, columns, unique values) - Returns formatted summary 5. LLM receives tool output 6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..." 7. Task complete → Output stored ``` #### Step 3.3: Task 2 - Data Analysis (`tasks.py`) - **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task) - **Agent**: Analyst Agent - **Execution Flow**: ``` 1. Analyst Agent receives user query + task description 2. LLM analyzes query: "What does user want?" 3. Agent decides which tools to use: - For aggregations → analyze_nba_data() - For searches → search_nba_data() or semantic_search_nba_data() - For overview → get_nba_data_summary() 4. Tool Execution Examples: Example A: "Top 5 three-point shooters" - Agent generates pandas code: df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5) - analyze_nba_data() executes code - Returns DataFrame with results - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..." Example B: "Find LeBron James games" - Agent uses search_nba_data(query="LeBron James") - Tool filters CSV, returns matching rows - LLM analyzes results, provides insights Example C: "High scoring games" - Agent uses semantic_search_nba_data("high scoring games") - Vector DB finds semantically similar records - Returns top matches with similarity scores - LLM provides analysis 5. LLM generates final analysis report 6. Task complete → Output stored ``` #### Step 3.4: Task 3 - Storytelling (`tasks.py`) - **File**: `tasks.py` Lines 98-130 (create_storyteller_task) - **Agent**: Storyteller Agent - **Dependency**: Waits for Analyst task to complete - **Execution Flow**: ``` 1. Storyteller Agent receives Analyst's output as context 2. LLM processes: "Create engaging headline and story" 3. No tools used (only LLM) 4. LLM generates: - Catchy headline - Engaging narrative - Context and insights 5. Task complete → Output stored ``` --- ### **Phase 4: Tool Execution Details** #### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30) ``` Input: limit (number of rows) Execution: 1. pd.read_csv(data_path) 2. df.head(limit) 3. Format as string Output: Sample rows with column names ``` #### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71) ``` Input: query (text), column (name), value (filter) Execution: 1. pd.read_csv(data_path) 2. Apply filters if provided 3. Text search across columns 4. Limit to 50 rows max Output: Filtered DataFrame as string ``` #### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94) ``` Input: None Execution: 1. pd.read_csv(data_path) 2. Calculate: total rows, columns, unique players/teams 3. Get date range 4. Identify numeric columns 5. Show sample rows Output: Comprehensive dataset summary ``` #### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175) ``` Input: query (natural language) Execution: 1. Get vector_db instance (vector_db.py) 2. Check if indexed (if not, index CSV) 3. Generate embedding for query 4. Search in ChromaDB 5. Return top N similar records 6. Load original CSV rows Output: Similar records with metadata ``` **Vector DB Indexing** (`vector_db.py` Lines 94-156): ``` First time only: 1. Load SentenceTransformer model 2. Read CSV 3. For each row: - Convert to text: "Player: X, Team: Y, Points: Z..." - Generate embedding - Store in ChromaDB with metadata 4. Persist to disk (chroma_db/) ``` #### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253) ``` Input: pandas_code (string of pandas operations) Execution: 1. Load CSV into DataFrame 'df' 2. Create safe namespace: {'pd': pandas, 'df': df} 3. Execute: exec(f"result = {pandas_code}", namespace) 4. Get result from namespace 5. Format output: - DataFrame → to_string() - Series → to_string() - Limit to 50 rows if large Output: Analysis results as string ``` --- ### **Phase 5: LLM Interaction** #### LLM Call Flow (`config.py` → LLM API) ``` 1. Agent needs to process task 2. Calls llm.call(prompt, ...) 3. config.py routes to provider: Hugging Face: - Format: huggingface/{model_name} - API: https://api-inference.huggingface.co - Request: POST with prompt - Response: Generated text Ollama: - Base URL: http://localhost:11434/v1 - OpenAI-compatible API - Request: POST /chat/completions - Response: Generated text OpenRouter: - Base URL: https://openrouter.ai/api/v1 - Request: POST with model name - Response: Generated text 4. LLM generates response 5. Response returned to agent 6. Agent processes response 7. Agent decides next action (use tool? finish? ask for clarification?) ``` --- ### **Phase 6: Result Aggregation** #### Result Collection (`app.py` Lines 39-80) ``` After crew.kickoff() completes: 1. Extract task outputs: - result.tasks_output[0] → Engineer result - result.tasks_output[1] → Analyst result - result.tasks_output[2] → Storyteller result 2. Format output: - Add headers: "## Engineer Agent Results" - Add separators: "---" - Combine all outputs 3. Store engineer result for reuse 4. Return formatted string to Gradio ``` #### Gradio Display (`app.py` Lines 200-340) ``` 1. User sees results in output textbox 2. Engineer result stored in hidden state 3. Can be reused for follow-up questions ``` --- ## 🔄 Parallel Execution Flow ### How Tasks Run in Parallel (`crew.py` Lines 69-104) ``` Time → │ ├─ Task 1: Engineer (independent) │ └─ Uses: get_nba_data_summary() │ ├─ Task 2: Analyst (independent, runs in parallel) │ └─ Uses: analyze_nba_data() or search_nba_data() │ └─ Task 3: Storyteller (waits for Analyst) └─ Uses: LLM only (no tools) ``` **Key Points**: - Engineer and Analyst run **simultaneously** (no dependencies) - Storyteller runs **after** Analyst completes (has dependency) - CrewAI handles parallelization automatically --- ## 📊 Data Flow Diagram ``` CSV File ↓ [pandas.read_csv()] ↓ DataFrame ↓ ├─→ Tools (read, search, analyze) │ ↓ │ Results → Agent → LLM → Response │ └─→ Vector DB (semantic search) ↓ [SentenceTransformer] ↓ Embeddings ↓ [ChromaDB] ↓ Similar Records → Agent → LLM → Response ``` --- ## 🎯 Example: Complete Execution Trace ### Input: - CSV: `nba24-25.csv` - Query: "Who are the top 5 three-point shooters?" ### Execution: 1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")` 2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")` 3. **agents.py**: Create Engineer, Analyst, Storyteller agents 4. **config.py**: `get_llm()` → Returns Hugging Face LLM 5. **crew.kickoff()** starts 6. **Task 1 (Engineer)**: - Agent: "I need to check the dataset" - Tool: `get_nba_data_summary()` - Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..." - LLM: "Dataset loaded. 5000 rows, ready for analysis." 7. **Task 2 (Analyst)** - Runs in parallel: - Agent: "User wants top 5 three-point shooters" - Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")` - Execution: ```python df = pd.read_csv("nba24-25.csv") result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5) # Returns: Player1: 250, Player2: 245, ... ``` - LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..." 8. **Task 3 (Storyteller)** - After Analyst: - Agent receives Analyst output - LLM: "🏀 **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..." 9. **app.py**: Combine all outputs 10. **Gradio**: Display to user --- ## 🔧 Key Configuration Points ### LLM Provider Selection (`config.py`) - Environment variable: `LLM_PROVIDER` - Options: `huggingface`, `ollama`, `openrouter`, `openai` - Default: `huggingface` ### Model Selection - Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`) - Ollama: `OLLAMA_MODEL` (default: `mistral`) - OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`) ### Data Path - Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py) - Can be overridden by uploaded file --- ## 🐛 Error Handling ### At Each Level: 1. **app.py** (Lines 82-86): - Try/except around `crew.kickoff()` - Returns error message with traceback 2. **Tools** (tools.py): - Each tool has try/except - Returns error message if fails 3. **Vector DB** (vector_db.py): - Handles missing files - Creates directory if needed - Handles indexing errors 4. **LLM** (config.py): - Validates API keys - Raises ValueError if missing - Handles API errors --- ## 📝 Summary **Input Flow**: ``` User → Gradio → app.py → crew.py → agents.py → tasks.py → tools.py → data/LLM ``` **Output Flow**: ``` LLM/data → tools.py → agents.py → tasks.py → crew.py → app.py → Gradio → User ``` **Key Points**: - All agents share the same LLM instance - Tools are stateless (read CSV each time) - Vector DB is persistent (indexed once, reused) - Tasks can run in parallel if no dependencies - Results are aggregated and formatted in app.py --- **Last Updated**: Based on current codebase structure **Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py