Spaces:

shekkari21
/

NBA_Analysis

Sleeping

File size: 14,324 Bytes

b34fde9

# Detailed Execution Flow - NBA Analysis Application

This document explains step-by-step how user input flows through the application and gets executed.

---

## 🎯 High-Level Flow Overview

```
User Input (CSV + Query)
    ↓
app.py (Gradio Interface)
    ↓
crew.py (CrewAI Orchestration)
    ↓
agents.py (AI Agents)
    ↓
tasks.py (Task Definitions)
    ↓
tools.py (Data Access Tools)
    ↓
vector_db.py / pandas (Data Processing)
    ↓
config.py (LLM Configuration)
    ↓
LLM API (Hugging Face / Ollama / etc.)
    ↓
Results → User
```

---

## 📋 Detailed Step-by-Step Execution

### **Phase 1: User Input & Initialization**

#### Step 1.1: User Interaction (`app.py`)
- **File**: `app.py`
- **Function**: `process_file_and_analyze()` or `process_question_only()`
- **Input**: 
  - CSV file (uploaded via Gradio)
  - User query (optional text)
- **What happens**:
  ```python
  # Line 23-24: Validate file exists
  if file is None:
      return "Please upload a CSV file."
  
  # Line 27-28: Set default query if empty
  if not user_query:
      user_query = "Provide comprehensive analysis..."
  
  # Line 32-33: Extract file path
  file_path = file.name
  csv_path = file_path
  ```

#### Step 1.2: Crew Creation (`crew.py`)
- **File**: `crew.py`
- **Function**: `create_flow_crew(user_query, csv_path)`
- **What happens**:
  ```python
  # Line 82-84: Create all agents
  engineer_agent = create_engineer_agent(csv_path)
  analyst_agent = create_analyst_agent(csv_path)
  storyteller_agent = create_storyteller_agent()
  
  # Line 88-94: Create tasks
  data_engineering_task = create_data_engineering_task(...)
  custom_analysis_task = create_custom_analysis_task(...)
  storyteller_task = create_storyteller_task(...)
  
  # Line 99-104: Create Crew with agents and tasks
  return Crew(agents=[...], tasks=[...], process=Process.sequential)
  ```

---

### **Phase 2: Agent Initialization**

#### Step 2.1: LLM Configuration (`config.py`)
- **File**: `config.py`
- **Function**: `get_llm()`
- **What happens**:
  ```python
  # Line 13: Check provider (default: "huggingface")
  LLM_PROVIDER = os.getenv("LLM_PROVIDER", "huggingface")
  
  # Line 54-64: Create LLM instance based on provider
  if LLM_PROVIDER == "huggingface":
      return LLM(
          model=f"huggingface/{HF_MODEL}",
          api_key=HF_API_KEY
      )
  # Similar for ollama, openrouter, etc.
  ```
- **Output**: Configured LLM instance (used by all agents)

#### Step 2.2: Agent Creation (`agents.py`)
- **File**: `agents.py`
- **Functions**: `create_engineer_agent()`, `create_analyst_agent()`, `create_storyteller_agent()`
- **What happens**:

**Engineer Agent** (Lines 12-36):
  ```python
  # Line 22-23: Get data path and tools
  data_path = csv_path or NBA_DATA_PATH
  agent_tools = get_agent_tools(data_path)
  
  # Line 25-36: Create agent with:
  - role: "Data Engineer"
  - goal: Process and clean data
  - backstory: Expert data engineer description
  - llm: Shared LLM instance
  - tools: Data access tools (read, search, analyze)
  ```

**Analyst Agent** (Lines 39-69):
  ```python
  # Similar structure but with:
  - role: "Data Analyst"
  - goal: Extract insights and patterns
  - backstory: Includes instructions to use analyze_nba_data for aggregations
  - tools: Same data tools
  ```

**Storyteller Agent** (Lines 72-93):
  ```python
  - role: "Sports Storyteller"
  - goal: Create engaging headlines from analysis
  - tools: [] (no data tools, only uses LLM)
  ```

#### Step 2.3: Tools Initialization (`tools.py`)
- **File**: `tools.py`
- **Function**: `get_agent_tools(data_path)`
- **What happens**:
  ```python
  # Returns list of 5 tools:
  1. read_nba_data(limit) - Read sample rows
  2. search_nba_data(query, column, value) - Filter/search CSV
  3. get_nba_data_summary() - Get dataset overview
  4. semantic_search_nba_data(query) - Vector search
  5. analyze_nba_data(pandas_code) - Execute pandas operations
  ```
- **Note**: Each tool is wrapped with `@tool` decorator for CrewAI

---

### **Phase 3: Task Execution**

#### Step 3.1: Crew Kickoff (`app.py` → `crew.py`)
- **File**: `app.py` Line 36-37
- **What happens**:
  ```python
  crew = create_flow_crew(user_query.strip(), csv_path)
  result = crew.kickoff()  # This triggers execution
  ```

#### Step 3.2: Task 1 - Data Engineering (`tasks.py`)
- **File**: `tasks.py` Lines 8-40
- **Task**: `create_data_engineering_task()`
- **Agent**: Engineer Agent
- **Execution Flow**:
  ```
  1. Engineer Agent receives task description
  2. LLM processes task: "Examine dataset, get summary..."
  3. Agent decides to use: get_nba_data_summary()
  4. Tool execution (tools.py):
     - Reads CSV with pandas
     - Calculates stats (rows, columns, unique values)
     - Returns formatted summary
  5. LLM receives tool output
  6. LLM generates confirmation: "Dataset loaded, X rows, Y columns..."
  7. Task complete → Output stored
  ```

#### Step 3.3: Task 2 - Data Analysis (`tasks.py`)
- **File**: `tasks.py` Lines 55-95 (create_custom_analysis_task)
- **Agent**: Analyst Agent
- **Execution Flow**:
  ```
  1. Analyst Agent receives user query + task description
  2. LLM analyzes query: "What does user want?"
  3. Agent decides which tools to use:
     - For aggregations → analyze_nba_data()
     - For searches → search_nba_data() or semantic_search_nba_data()
     - For overview → get_nba_data_summary()
  
  4. Tool Execution Examples:
  
  Example A: "Top 5 three-point shooters"
    - Agent generates pandas code:
      df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
    - analyze_nba_data() executes code
    - Returns DataFrame with results
    - LLM formats output: "Top 5: Player1 (X), Player2 (Y)..."
  
  Example B: "Find LeBron James games"
    - Agent uses search_nba_data(query="LeBron James")
    - Tool filters CSV, returns matching rows
    - LLM analyzes results, provides insights
  
  Example C: "High scoring games"
    - Agent uses semantic_search_nba_data("high scoring games")
    - Vector DB finds semantically similar records
    - Returns top matches with similarity scores
    - LLM provides analysis
  
  5. LLM generates final analysis report
  6. Task complete → Output stored
  ```

#### Step 3.4: Task 3 - Storytelling (`tasks.py`)
- **File**: `tasks.py` Lines 98-130 (create_storyteller_task)
- **Agent**: Storyteller Agent
- **Dependency**: Waits for Analyst task to complete
- **Execution Flow**:
  ```
  1. Storyteller Agent receives Analyst's output as context
  2. LLM processes: "Create engaging headline and story"
  3. No tools used (only LLM)
  4. LLM generates:
     - Catchy headline
     - Engaging narrative
     - Context and insights
  5. Task complete → Output stored
  ```

---

### **Phase 4: Tool Execution Details**

#### Tool 1: `read_nba_data(limit)` (`tools.py` Lines 22-30)
```
Input: limit (number of rows)
Execution:
  1. pd.read_csv(data_path)
  2. df.head(limit)
  3. Format as string
Output: Sample rows with column names
```

#### Tool 2: `search_nba_data(query, column, value)` (`tools.py` Lines 32-71)
```
Input: query (text), column (name), value (filter)
Execution:
  1. pd.read_csv(data_path)
  2. Apply filters if provided
  3. Text search across columns
  4. Limit to 50 rows max
Output: Filtered DataFrame as string
```

#### Tool 3: `get_nba_data_summary()` (`tools.py` Lines 73-94)
```
Input: None
Execution:
  1. pd.read_csv(data_path)
  2. Calculate: total rows, columns, unique players/teams
  3. Get date range
  4. Identify numeric columns
  5. Show sample rows
Output: Comprehensive dataset summary
```

#### Tool 4: `semantic_search_nba_data(query)` (`tools.py` Lines 135-175)
```
Input: query (natural language)
Execution:
  1. Get vector_db instance (vector_db.py)
  2. Check if indexed (if not, index CSV)
  3. Generate embedding for query
  4. Search in ChromaDB
  5. Return top N similar records
  6. Load original CSV rows
Output: Similar records with metadata
```

**Vector DB Indexing** (`vector_db.py` Lines 94-156):
```
First time only:
  1. Load SentenceTransformer model
  2. Read CSV
  3. For each row:
     - Convert to text: "Player: X, Team: Y, Points: Z..."
     - Generate embedding
     - Store in ChromaDB with metadata
  4. Persist to disk (chroma_db/)
```

#### Tool 5: `analyze_nba_data(pandas_code)` (`tools.py` Lines 203-253)
```
Input: pandas_code (string of pandas operations)
Execution:
  1. Load CSV into DataFrame 'df'
  2. Create safe namespace: {'pd': pandas, 'df': df}
  3. Execute: exec(f"result = {pandas_code}", namespace)
  4. Get result from namespace
  5. Format output:
     - DataFrame → to_string()
     - Series → to_string()
     - Limit to 50 rows if large
Output: Analysis results as string
```

---

### **Phase 5: LLM Interaction**

#### LLM Call Flow (`config.py` → LLM API)
```
1. Agent needs to process task
2. Calls llm.call(prompt, ...)
3. config.py routes to provider:
   
   Hugging Face:
   - Format: huggingface/{model_name}
   - API: https://api-inference.huggingface.co
   - Request: POST with prompt
   - Response: Generated text
   
   Ollama:
   - Base URL: http://localhost:11434/v1
   - OpenAI-compatible API
   - Request: POST /chat/completions
   - Response: Generated text
   
   OpenRouter:
   - Base URL: https://openrouter.ai/api/v1
   - Request: POST with model name
   - Response: Generated text

4. LLM generates response
5. Response returned to agent
6. Agent processes response
7. Agent decides next action (use tool? finish? ask for clarification?)
```

---

### **Phase 6: Result Aggregation**

#### Result Collection (`app.py` Lines 39-80)
```
After crew.kickoff() completes:

1. Extract task outputs:
   - result.tasks_output[0] → Engineer result
   - result.tasks_output[1] → Analyst result
   - result.tasks_output[2] → Storyteller result

2. Format output:
   - Add headers: "## Engineer Agent Results"
   - Add separators: "---"
   - Combine all outputs

3. Store engineer result for reuse

4. Return formatted string to Gradio
```

#### Gradio Display (`app.py` Lines 200-340)
```
1. User sees results in output textbox
2. Engineer result stored in hidden state
3. Can be reused for follow-up questions
```

---

## 🔄 Parallel Execution Flow

### How Tasks Run in Parallel (`crew.py` Lines 69-104)

```
Time →
│
├─ Task 1: Engineer (independent)
│  └─ Uses: get_nba_data_summary()
│
├─ Task 2: Analyst (independent, runs in parallel)
│  └─ Uses: analyze_nba_data() or search_nba_data()
│
└─ Task 3: Storyteller (waits for Analyst)
   └─ Uses: LLM only (no tools)
```

**Key Points**:
- Engineer and Analyst run **simultaneously** (no dependencies)
- Storyteller runs **after** Analyst completes (has dependency)
- CrewAI handles parallelization automatically

---

## 📊 Data Flow Diagram

```
CSV File
    ↓
[pandas.read_csv()]
    ↓
DataFrame
    ↓
    ├─→ Tools (read, search, analyze)
    │       ↓
    │   Results → Agent → LLM → Response
    │
    └─→ Vector DB (semantic search)
            ↓
        [SentenceTransformer]
            ↓
        Embeddings
            ↓
        [ChromaDB]
            ↓
        Similar Records → Agent → LLM → Response
```

---

## 🎯 Example: Complete Execution Trace

### Input:
- CSV: `nba24-25.csv`
- Query: "Who are the top 5 three-point shooters?"

### Execution:

1. **app.py**: `process_file_and_analyze(file, "top 5 three-point shooters")`
2. **crew.py**: `create_flow_crew("top 5...", "nba24-25.csv")`
3. **agents.py**: Create Engineer, Analyst, Storyteller agents
4. **config.py**: `get_llm()` → Returns Hugging Face LLM
5. **crew.kickoff()** starts

6. **Task 1 (Engineer)**:
   - Agent: "I need to check the dataset"
   - Tool: `get_nba_data_summary()`
   - Result: "Dataset has 5000 rows, columns: Player, Team, 3P, ..."
   - LLM: "Dataset loaded. 5000 rows, ready for analysis."

7. **Task 2 (Analyst)** - Runs in parallel:
   - Agent: "User wants top 5 three-point shooters"
   - Tool: `analyze_nba_data("df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)")`
   - Execution:
     ```python
     df = pd.read_csv("nba24-25.csv")
     result = df.groupby('Player')['3P'].sum().sort_values(ascending=False).head(5)
     # Returns: Player1: 250, Player2: 245, ...
     ```
   - LLM: "Top 5 three-point shooters: 1. Player1 (250), 2. Player2 (245)..."

8. **Task 3 (Storyteller)** - After Analyst:
   - Agent receives Analyst output
   - LLM: "🏀 **Splash Brothers Dominate: Top 5 Three-Point Sharpshooters Revealed** ..."

9. **app.py**: Combine all outputs
10. **Gradio**: Display to user

---

## 🔧 Key Configuration Points

### LLM Provider Selection (`config.py`)
- Environment variable: `LLM_PROVIDER`
- Options: `huggingface`, `ollama`, `openrouter`, `openai`
- Default: `huggingface`

### Model Selection
- Hugging Face: `HF_MODEL` (default: `meta-llama/Llama-3.1-8B-Instruct`)
- Ollama: `OLLAMA_MODEL` (default: `mistral`)
- OpenRouter: `OPENROUTER_MODEL` (default: `google/gemma-2-2b-it:free`)

### Data Path
- Default: `NBA_DATA_PATH = "nba24-25.csv"` (config.py)
- Can be overridden by uploaded file

---

## 🐛 Error Handling

### At Each Level:

1. **app.py** (Lines 82-86):
   - Try/except around `crew.kickoff()`
   - Returns error message with traceback

2. **Tools** (tools.py):
   - Each tool has try/except
   - Returns error message if fails

3. **Vector DB** (vector_db.py):
   - Handles missing files
   - Creates directory if needed
   - Handles indexing errors

4. **LLM** (config.py):
   - Validates API keys
   - Raises ValueError if missing
   - Handles API errors

---

## 📝 Summary

**Input Flow**:
```
User → Gradio → app.py → crew.py → agents.py → tasks.py → tools.py → data/LLM
```

**Output Flow**:
```
LLM/data → tools.py → agents.py → tasks.py → crew.py → app.py → Gradio → User
```

**Key Points**:
- All agents share the same LLM instance
- Tools are stateless (read CSV each time)
- Vector DB is persistent (indexed once, reused)
- Tasks can run in parallel if no dependencies
- Results are aggregated and formatted in app.py

---

**Last Updated**: Based on current codebase structure
**Files Involved**: app.py, crew.py, agents.py, tasks.py, tools.py, vector_db.py, config.py