Spaces:
Sleeping
Sleeping
| # Detailed Execution Flow - NBA Analysis Application | |
| This document explains step-by-step how user input flows through the application and gets executed. | |
| --- | |
| ## π― High-Level Flow Overview | |
| ``` | |
| User Input (CSV + Query) | |
| β | |
| app.py (Gradio Interface) | |
| β | |
| crew.py (CrewAI Orchestration) | |
| β | |
| agents.py (AI Agents) | |
| β | |
| tasks.py (Task Definitions) | |
| β | |
| tools.py (Data Access Tools) | |
| β | |
| vector_db.py / pandas (Data Processing) | |
| β | |
| config.py (LLM Configuration) | |
| β | |
| LLM API (Hugging Face / Ollama / etc.) | |
| β | |
| Results β User | |
| ``` | |
| --- | |
| ## π Detailed Step-by-Step Execution | |
| ### **Phase 1: User Input & Initialization** | |
| #### Step 1.1: User Interaction (`app.py`) | |
| - **File**: `app.py` | |
| - **Function**: `process_file_and_analyze()` or `process_question_only()` | |
| - **Input**: | |
| - CSV file (uploaded via Gradio) | |
| - User query (optional text) | |
| - **What happens**: | |
| ```python | |
| # Line 23-24: Validate file exists | |
| if file is None: | |
| return "Please upload a CSV file." | |
| # Line 27-28: Set default query if empty | |
| if not user_query: | |
| user_query = "Provide comprehensive analysis..." | |
| # Line 32-33: Extract file path | |
| file_path = file.name | |
| csv_path = file_path | |
| ``` | |
| #### Step 1.2: Crew Creation (`crew.py`) | |
| - **File**: `crew.py` | |
| - **Function**: `create_flow_crew(user_query, csv_path)` | |
| - **What happens**: | |
| ```python | |
| # Line 82-84: Create all agents | |
| engineer_agent = create_engineer_agent(csv_path) | |
| analyst_agent = create_analyst_agent(csv_path) | |
| storyteller_agent = create_storyteller_agent() | |
| # Line 88-94: Create tasks | |
| data_engineering_task = create_data_engineering_task(...) | |
| custom_analysis_task = create_custom_analysis_task(...) | |
| storyteller_task = create_storyteller_task(...) | |
| # Line 99-104: Create Crew with agents and tasks | |
| return Crew(agents=[...], tasks=[...], process=Process.sequential) | |
| ``` | |
| --- | |
| ### **Phase 2: Agent Initialization** | |
| #### Step 2.1: LLM Configuration (`config.py`) | |
| - **File**: `config.py` | |
| - **Function**: `get_llm()` | |
| - **What happens**: | |
| ```python | |
| # Line 13: Check provider (default: "huggingface") | |
| LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface") | |
| # Line 54-64: Create LLM instance based on provider | |
| if LLM_PROVIDER == "huggingface": | |
| return LLM( | |
| model=f"huggingface/{HF_MODEL}", | |
| api_key=HF_API_KEY | |
| ) | |
| # Similar for ollama, openrouter, etc. | |
| ``` | |
| - **Output**: Configured LLM instance (used by all agents) | |
| #### Step 2.2: Agent Creation (`agents.py`) | |
| - **File**: `agents.py` | |
| - **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()` | |
| - **What happens**: | |
| **Engineer Agent** (Lines 12-36): | |
| ```python | |
| # Line 22-23: Get data path and tools | |
| data_path = csv_path or NBA_DATA_PATH | |
| agent_tools = get_agent_tools(data_path) | |
| # Line 25-36: Create agent with: | |
| - role: "Data Engineer" | |
| - goal: Process and clean data | |
| - backstory: Expert data engineer description | |
| - llm: Shared LLM instance | |
| - tools: Data access tools (read, search, analyze) | |
| ``` | |
| **Analyst Agent** (Lines 39-69): | |
| ```python | |
| # Similar structure but with: | |
| - role: "Data Analyst" | |
| - goal: Extract insights and patterns | |
| - backstory: Includes instructions to use analyze_nba_data for aggregations | |
| - tools: Same data tools | |
| ``` | |
| **Storyteller Agent** (Lines 72-93): | |
| ```python | |
| - role: "Sports Storyteller" | |
| - goal: Create engaging headlines from analysis | |
| - tools: [] (no data tools, only uses LLM) | |
| ``` | |
| #### Step 2.3: Tools Initialization (`tools.py`) | |
| - **File**: `tools.py` | |
| - **Function**: `get_agent_tools(data_path)` | |
| - **What happens**: | |
| ```python | |
| # Returns list of 5 tools: | |
| 1. read_nba_data(limit) - Read sample rows | |
| 2. search_nba_data(query, column, value) - Filter/search CSV | |
| 3. get_nba_data_summary() - Get dataset overview | |
| 4. semantic_search_nba_data(query) - Vector search | |
| 5. analyze_nba_data(pandas_code) - Execute pandas operations | |
| ``` | |
| - **Note**: Each tool is wrapped with `@tool` decorator for CrewAI | |
| --- | |
| ### **Phase 3: Task Execution** | |
| #### Step 3.1: Crew Kickoff (`app.py` β `crew.py`) | |
| - **File**: `app.py` Line 36-37 | |
| - **What happens**: | |
| ```python | |
| crew = create_flow_crew(user_query.strip(), csv_path) | |
| result = crew.kickoff() # This triggers execution | |
| ``` | |
| #### Step 3.2: Task 1 - Data Engineering (`tasks.py`) | |
| - **File**: `tasks.py` Lines 8-40 | |
| - **Task**: `create_data_engineering_task()` | |
| - **Agent**: Engineer Agent | |
| - **Execution Flow**: | |
| ``` | |
| 1. Engineer Agent receives task description | |
| 2. LLM processes task: "Examine dataset, get summary..." | |
| 3. Agent decides to use: get_nba_data_summary() | |
| 4. Tool execution (tools.py): | |
| - Reads CSV with pandas | |
| - Calculates stats (rows, columns, unique values) | |
| - Returns formatted summary | |
| 5. LLM receives tool output | |
| 6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..." | |
| 7. Task complete β Output stored | |
| ``` | |
| #### Step 3.3: Task 2 - Data Analysis (`tasks.py`) | |
| - **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task) | |
| - **Agent**: Analyst Agent | |
| - **Execution Flow**: | |
| ``` | |
| 1. Analyst Agent receives user query + task description | |
| 2. LLM analyzes query: "What does user want?" | |
| 3. Agent decides which tools to use: | |
| - For aggregations β analyze_nba_data() | |
| - For searches β search_nba_data() or semantic_search_nba_data() | |
| - For overview β get_nba_data_summary() | |
| 4. Tool Execution Examples: | |
| Example A: "Top 5 three-point shooters" | |
| - Agent generates pandas code: | |
| df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5) | |
| - analyze_nba_data() executes code | |
| - Returns DataFrame with results | |
| - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..." | |
| Example B: "Find LeBron James games" | |
| - Agent uses search_nba_data(query="LeBron James") | |
| - Tool filters CSV, returns matching rows | |
| - LLM analyzes results, provides insights | |
| Example C: "High scoring games" | |
| - Agent uses semantic_search_nba_data("high scoring games") | |
| - Vector DB finds semantically similar records | |
| - Returns top matches with similarity scores | |
| - LLM provides analysis | |
| 5. LLM generates final analysis report | |
| 6. Task complete β Output stored | |
| ``` | |
| #### Step 3.4: Task 3 - Storytelling (`tasks.py`) | |
| - **File**: `tasks.py` Lines 98-130 (create_storyteller_task) | |
| - **Agent**: Storyteller Agent | |
| - **Dependency**: Waits for Analyst task to complete | |
| - **Execution Flow**: | |
| ``` | |
| 1. Storyteller Agent receives Analyst's output as context | |
| 2. LLM processes: "Create engaging headline and story" | |
| 3. No tools used (only LLM) | |
| 4. LLM generates: | |
| - Catchy headline | |
| - Engaging narrative | |
| - Context and insights | |
| 5. Task complete β Output stored | |
| ``` | |
| --- | |
| ### **Phase 4: Tool Execution Details** | |
| #### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30) | |
| ``` | |
| Input: limit (number of rows) | |
| Execution: | |
| 1. pd.read_csv(data_path) | |
| 2. df.head(limit) | |
| 3. Format as string | |
| Output: Sample rows with column names | |
| ``` | |
| #### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71) | |
| ``` | |
| Input: query (text), column (name), value (filter) | |
| Execution: | |
| 1. pd.read_csv(data_path) | |
| 2. Apply filters if provided | |
| 3. Text search across columns | |
| 4. Limit to 50 rows max | |
| Output: Filtered DataFrame as string | |
| ``` | |
| #### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94) | |
| ``` | |
| Input: None | |
| Execution: | |
| 1. pd.read_csv(data_path) | |
| 2. Calculate: total rows, columns, unique players/teams | |
| 3. Get date range | |
| 4. Identify numeric columns | |
| 5. Show sample rows | |
| Output: Comprehensive dataset summary | |
| ``` | |
| #### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175) | |
| ``` | |
| Input: query (natural language) | |
| Execution: | |
| 1. Get vector_db instance (vector_db.py) | |
| 2. Check if indexed (if not, index CSV) | |
| 3. Generate embedding for query | |
| 4. Search in ChromaDB | |
| 5. Return top N similar records | |
| 6. Load original CSV rows | |
| Output: Similar records with metadata | |
| ``` | |
| **Vector DB Indexing** (`vector_db.py` Lines 94-156): | |
| ``` | |
| First time only: | |
| 1. Load SentenceTransformer model | |
| 2. Read CSV | |
| 3. For each row: | |
| - Convert to text: "Player: X, Team: Y, Points: Z..." | |
| - Generate embedding | |
| - Store in ChromaDB with metadata | |
| 4. Persist to disk (chroma_db/) | |
| ``` | |
| #### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253) | |
| ``` | |
| Input: pandas_code (string of pandas operations) | |
| Execution: | |
| 1. Load CSV into DataFrame 'df' | |
| 2. Create safe namespace: {'pd': pandas, 'df': df} | |
| 3. Execute: exec(f"result = {pandas_code}", namespace) | |
| 4. Get result from namespace | |
| 5. Format output: | |
| - DataFrame β to_string() | |
| - Series β to_string() | |
| - Limit to 50 rows if large | |
| Output: Analysis results as string | |
| ``` | |
| --- | |
| ### **Phase 5: LLM Interaction** | |
| #### LLM Call Flow (`config.py` β LLM API) | |
| ``` | |
| 1. Agent needs to process task | |
| 2. Calls llm.call(prompt, ...) | |
| 3. config.py routes to provider: | |
| Hugging Face: | |
| - Format: huggingface/{model_name} | |
| - API: https://api-inference.huggingface.co | |
| - Request: POST with prompt | |
| - Response: Generated text | |
| Ollama: | |
| - Base URL: http://localhost:11434/v1 | |
| - OpenAI-compatible API | |
| - Request: POST /chat/completions | |
| - Response: Generated text | |
| OpenRouter: | |
| - Base URL: https://openrouter.ai/api/v1 | |
| - Request: POST with model name | |
| - Response: Generated text | |
| 4. LLM generates response | |
| 5. Response returned to agent | |
| 6. Agent processes response | |
| 7. Agent decides next action (use tool? finish? ask for clarification?) | |
| ``` | |
| --- | |
| ### **Phase 6: Result Aggregation** | |
| #### Result Collection (`app.py` Lines 39-80) | |
| ``` | |
| After crew.kickoff() completes: | |
| 1. Extract task outputs: | |
| - result.tasks_output[0] β Engineer result | |
| - result.tasks_output[1] β Analyst result | |
| - result.tasks_output[2] β Storyteller result | |
| 2. Format output: | |
| - Add headers: "## Engineer Agent Results" | |
| - Add separators: "---" | |
| - Combine all outputs | |
| 3. Store engineer result for reuse | |
| 4. Return formatted string to Gradio | |
| ``` | |
| #### Gradio Display (`app.py` Lines 200-340) | |
| ``` | |
| 1. User sees results in output textbox | |
| 2. Engineer result stored in hidden state | |
| 3. Can be reused for follow-up questions | |
| ``` | |
| --- | |
| ## π Parallel Execution Flow | |
| ### How Tasks Run in Parallel (`crew.py` Lines 69-104) | |
| ``` | |
| Time β | |
| β | |
| ββ Task 1: Engineer (independent) | |
| β ββ Uses: get_nba_data_summary() | |
| β | |
| ββ Task 2: Analyst (independent, runs in parallel) | |
| β ββ Uses: analyze_nba_data() or search_nba_data() | |
| β | |
| ββ Task 3: Storyteller (waits for Analyst) | |
| ββ Uses: LLM only (no tools) | |
| ``` | |
| **Key Points**: | |
| - Engineer and Analyst run **simultaneously** (no dependencies) | |
| - Storyteller runs **after** Analyst completes (has dependency) | |
| - CrewAI handles parallelization automatically | |
| --- | |
| ## π Data Flow Diagram | |
| ``` | |
| CSV File | |
| β | |
| [pandas.read_csv()] | |
| β | |
| DataFrame | |
| β | |
| βββ Tools (read, search, analyze) | |
| β β | |
| β Results β Agent β LLM β Response | |
| β | |
| βββ Vector DB (semantic search) | |
| β | |
| [SentenceTransformer] | |
| β | |
| Embeddings | |
| β | |
| [ChromaDB] | |
| β | |
| Similar Records β Agent β LLM β Response | |
| ``` | |
| --- | |
| ## π― Example: Complete Execution Trace | |
| ### Input: | |
| - CSV: `nba24-25.csv` | |
| - Query: "Who are the top 5 three-point shooters?" | |
| ### Execution: | |
| 1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")` | |
| 2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")` | |
| 3. **agents.py**: Create Engineer, Analyst, Storyteller agents | |
| 4. **config.py**: `get_llm()` β Returns Hugging Face LLM | |
| 5. **crew.kickoff()** starts | |
| 6. **Task 1 (Engineer)**: | |
| - Agent: "I need to check the dataset" | |
| - Tool: `get_nba_data_summary()` | |
| - Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..." | |
| - LLM: "Dataset loaded. 5000 rows, ready for analysis." | |
| 7. **Task 2 (Analyst)** - Runs in parallel: | |
| - Agent: "User wants top 5 three-point shooters" | |
| - Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")` | |
| - Execution: | |
| ```python | |
| df = pd.read_csv("nba24-25.csv") | |
| result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5) | |
| # Returns: Player1: 250, Player2: 245, ... | |
| ``` | |
| - LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..." | |
| 8. **Task 3 (Storyteller)** - After Analyst: | |
| - Agent receives Analyst output | |
| - LLM: "π **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..." | |
| 9. **app.py**: Combine all outputs | |
| 10. **Gradio**: Display to user | |
| --- | |
| ## π§ Key Configuration Points | |
| ### LLM Provider Selection (`config.py`) | |
| - Environment variable: `LLM_PROVIDER` | |
| - Options: `huggingface`, `ollama`, `openrouter`, `openai` | |
| - Default: `huggingface` | |
| ### Model Selection | |
| - Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`) | |
| - Ollama: `OLLAMA_MODEL` (default: `mistral`) | |
| - OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`) | |
| ### Data Path | |
| - Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py) | |
| - Can be overridden by uploaded file | |
| --- | |
| ## π Error Handling | |
| ### At Each Level: | |
| 1. **app.py** (Lines 82-86): | |
| - Try/except around `crew.kickoff()` | |
| - Returns error message with traceback | |
| 2. **Tools** (tools.py): | |
| - Each tool has try/except | |
| - Returns error message if fails | |
| 3. **Vector DB** (vector_db.py): | |
| - Handles missing files | |
| - Creates directory if needed | |
| - Handles indexing errors | |
| 4. **LLM** (config.py): | |
| - Validates API keys | |
| - Raises ValueError if missing | |
| - Handles API errors | |
| --- | |
| ## π Summary | |
| **Input Flow**: | |
| ``` | |
| User β Gradio β app.py β crew.py β agents.py β tasks.py β tools.py β data/LLM | |
| ``` | |
| **Output Flow**: | |
| ``` | |
| LLM/data β tools.py β agents.py β tasks.py β crew.py β app.py β Gradio β User | |
| ``` | |
| **Key Points**: | |
| - All agents share the same LLM instance | |
| - Tools are stateless (read CSV each time) | |
| - Vector DB is persistent (indexed once, reused) | |
| - Tasks can run in parallel if no dependencies | |
| - Results are aggregated and formatted in app.py | |
| --- | |
| **Last Updated**: Based on current codebase structure | |
| **Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py | |