agentbee

Sleeping

mangubee Claude Sonnet 4.5 commited on Jan 2

Commit

4c0b7eb

1 Parent(s): 87de1a7

Stage 3: Core Logic Implementation - LLM Integration

Implemented Stage 3 core agent logic with full LLM integration:

**New Features:**
- LLM-based planning: Analyzes questions and generates execution plans
- Dynamic tool selection: Claude function calling for tool selection
- Parameter extraction: LLM extracts tool parameters from questions
- Answer synthesis: LLM generates factoid answers from evidence
- Conflict resolution: LLM evaluates contradictory information

**New Files:**
- src/agent/llm_client.py - Centralized LLM client
- test/test_llm_integration.py - 8 new LLM integration tests
- test/test_stage3_e2e.py - Manual E2E test script

**Modified Files:**
- src/agent/graph.py - Implemented all node logic (plan/execute/answer)
- AgentState schema: Added file_paths, tool_results, evidence fields

**Framework Updates:**
- Updated dev records (04-07) with new framework parameters
- Added missing parameters from framework v2026-01-02

**Test Results:**
- All 99 tests passing (6 Stage 1 + 85 Stage 2 + 8 Stage 3)
- No regressions from previous stages

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (10) hide show

CHANGELOG.md +35 -6
PLAN.md +178 -8
dev/dev_260101_04_level3_task_workflow_design.md +14 -0
dev/dev_260101_05_level4_agent_level_design.md +7 -0
dev/dev_260101_06_level5_component_selection.md +14 -0
dev/dev_260101_07_level6_implementation_framework.md +14 -0
src/agent/graph.py +128 -36
src/agent/llm_client.py +350 -0
test/test_llm_integration.py +287 -0
test/test_stage3_e2e.py +77 -0

CHANGELOG.md CHANGED Viewed

@@ -1,22 +1,51 @@
 # Session Changelog
-**Session Date:** [YYYY-MM-DD]
-**Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
 ## Changes Made
 ### Created Files
-- [file path] - [Purpose/description]
 ### Modified Files
-- [file path] - [What was changed]
 ### Deleted Files
-- [file path] - [Reason for deletion]
 ## Notes
-[Any additional context about the session's work]

 # Session Changelog
+**Session Date:** 2026-01-02
+**Dev Record:** dev/dev_260102_14_stage3_core_logic.md
 ## Changes Made
 ### Created Files
+- `src/agent/llm_client.py` - Centralized LLM client for planning, tool selection, and answer synthesis
+- `test/test_llm_integration.py` - 8 tests for LLM integration (planning, tool selection, answer synthesis)
+- `test/test_stage3_e2e.py` - Manual E2E test script for real API testing
 ### Modified Files
+- `src/agent/graph.py` - Updated AgentState schema, implemented Stage 3 logic in all nodes (plan/execute/answer)
+- `PLAN.md` - Created implementation plan for Stage 3
+- `TODO.md` - Created task tracking list for Stage 3
+- `requirements.txt` - Already includes anthropic>=0.39.0
 ### Deleted Files
+- None
 ## Notes
+Stage 3 Core Logic Implementation:
+**State Schema Updates:**
+- Added new state fields: file_paths, tool_results, evidence
+**Node Implementations:**
+- plan_node: LLM-based planning with dynamic tool selection
+- execute_node: LLM function calling for tool selection and parameter extraction
+- answer_node: LLM-based answer synthesis with conflict resolution
+**LLM Integration:**
+- All three nodes now use Claude Sonnet 4.5 for dynamic decision-making
+- Centralized LLM client in src/agent/llm_client.py
+- Functions: plan_question, select_tools_with_function_calling, synthesize_answer
+**Testing:**
+- Added 8 new Stage 3 tests (test_llm_integration.py)
+- All 99 tests passing (Stage 1: 6, Stage 2: 85, Stage 3: 8)
+- Created manual E2E test script for real API testing
+Next steps: Deploy to HuggingFace Spaces and verify with actual GAIA questions

PLAN.md CHANGED Viewed

@@ -1,21 +1,191 @@
-# Implementation Plan
-**Date:** [YYYY-MM-DD]
-**Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
-**Status:** [Planning | In Progress | Completed]
 ## Objective
-[Clear goal statement]
 ## Steps
-[Implementation steps]
 ## Files to Modify
-[List of files]
 ## Success Criteria
-[Completion criteria]

+# Implementation Plan - Stage 3: Core Logic Implementation
+**Date:** 2026-01-02
+**Dev Record:** dev/dev_260102_14_stage3_core_logic.md
+**Status:** Planning
 ## Objective
+Implement Stage 3 core agent logic: LLM-based tool selection, parameter extraction, answer synthesis, and conflict resolution to complete the GAIA benchmark agent MVP.
 ## Steps
+### 1. Update Agent State Schema
+**File:** `src/agent/state.py`
+**Changes:**
+- Add `plan: str` field for execution plan from planning node
+- Add `tool_calls: List[Dict]` field for tracking tool invocations
+- Add `tool_results: List[Dict]` field for storing tool outputs
+- Add `evidence: List[str]` field for collecting information from tools
+- Add `conflicts: List[Dict]` field for tracking conflicting information (optional)
+### 2. Implement Planning Node Logic
+**File:** `src/agent/graph.py` - Update `plan_node` function
+**Current:** Placeholder that sets plan to "Stage 1 complete"
+**New logic:**
+- Accept `question` and `file_paths` from state
+- Use LLM to analyze question and determine required tools
+- Generate step-by-step execution plan
+- Identify which tools to use and what parameters to extract
+- Update state with execution plan
+- Return updated state
+**LLM Prompt Strategy:**
+- System: "You are a planning agent. Analyze the question and create an execution plan."
+- User: Provide question, available tools (from TOOLS registry), file information
+- Expected output: Structured plan with tool selection reasoning
+### 3. Implement Execute Node Logic
+**File:** `src/agent/graph.py` - Update `execute_node` function
+**Current:** Reports "Stage 2 complete: 4 tools ready"
+**New logic:**
+- Use LLM function calling to dynamically select tools
+- Extract parameters from question using LLM
+- Execute selected tools sequentially based on plan
+- Collect results in `tool_results` field
+- Extract evidence from each tool result
+- Handle tool failures with retry logic (already in tools)
+- Update state with tool results and evidence
+- Return updated state
+**LLM Function Calling Strategy:**
+- Define tool schemas for Claude function calling
+- Let LLM decide which tools to invoke based on question
+- LLM extracts parameters from question automatically
+- Execute tool calls and collect results
+### 4. Implement Answer Node Logic
+**File:** `src/agent/graph.py` - Update `answer_node` function
+**Current:** Placeholder that returns "This is a placeholder answer"
+**New logic:**
+- Accept evidence from execute node
+- Use LLM to synthesize factoid answer from evidence
+- Detect and resolve conflicts in evidence (LLM-based reasoning)
+- Format answer according to GAIA requirements (factoid: number/few words/comma-separated)
+- Update state with final answer
+- Return updated state
+**LLM Answer Synthesis Strategy:**
+- System: "You are an answer synthesizer. Extract factoid answer from evidence."
+- User: Provide all evidence, specify factoid format requirements
+- Conflict resolution: If evidence conflicts, LLM evaluates source credibility/recency
+- Expected output: Concise factoid answer
+### 5. Configure LLM Client
+**File:** `src/agent/llm_client.py` (NEW)
+**Purpose:** Centralized LLM interaction for all nodes
+**Functions:**
+- `create_client()` - Initialize Anthropic client
+- `plan_question(question, tools, files)` - Call LLM for planning
+- `select_and_execute_tools(question, plan, tools)` - Function calling for tool selection
+- `synthesize_answer(question, evidence)` - Call LLM for answer synthesis
+- `resolve_conflicts(evidence)` - Call LLM for conflict resolution (optional)
+**Configuration:**
+- Use Claude Sonnet 4.5 (as per Level 5 decision)
+- API key from environment variable
+- Temperature: 0 for deterministic answers
+- Max tokens: 4096 for reasoning
+### 6. Update Test Suite
+**Files:**
+- `test/test_agent.py` - Update agent tests
+- `test/test_llm_integration.py` (NEW) - Test LLM interactions with mocks
+**Test cases:**
+- Test planning node generates valid execution plan
+- Test execute node calls correct tools with correct parameters
+- Test answer node synthesizes factoid answer
+- Test conflict resolution logic
+- Test end-to-end agent workflow with mock LLM responses
+- Test error handling (tool failures, LLM timeouts)
+### 7. Update Requirements
+**File:** `requirements.txt`
+**Add:**
+- `anthropic>=0.40.0` - Claude API client
+### 8. Deploy and Verify
+**Actions:**
+- Commit and push to HuggingFace Spaces
+- Verify build succeeds
+- Test agent with sample GAIA questions
+- Verify output format matches GAIA requirements
 ## Files to Modify
+1. `src/agent/state.py` - Expand state schema for Stage 3
+2. `src/agent/graph.py` - Implement plan/execute/answer node logic
+3. `src/agent/llm_client.py` - NEW - Centralized LLM client
+4. `test/test_agent.py` - Update tests for Stage 3
+5. `test/test_llm_integration.py` - NEW - LLM integration tests
+6. `requirements.txt` - Add anthropic library
+7. `pyproject.toml` - Install anthropic via uv
 ## Success Criteria
+- [ ] Planning node analyzes question and generates execution plan using LLM
+- [ ] Execute node dynamically selects tools using LLM function calling
+- [ ] Execute node extracts parameters from questions automatically
+- [ ] Execute node executes tools and collects evidence
+- [ ] Answer node synthesizes factoid answer from evidence
+- [ ] Conflict resolution handles contradictory information
+- [ ] All Stage 1 + Stage 2 tests still pass (97 tests)
+- [ ] New Stage 3 tests pass (minimum 10 new tests)
+- [ ] Agent successfully answers sample GAIA questions end-to-end
+- [ ] Output format matches GAIA factoid requirements
+- [ ] Deployment to HuggingFace Spaces succeeds
+## Design Alignment
+**Level 3:** Dynamic planning with sequential execution ✓
+**Level 4:** Goal-based reasoning, termination after answer_node ✓
+**Level 5:** LLM-generated answer synthesis, LLM-based conflict resolution ✓
+**Level 6:** LLM function calling for tool selection, LLM-based parameter extraction ✓
+## Stage 3 Scope
+**In scope:**
+- LLM-based planning, tool selection, parameter extraction
+- Answer synthesis and conflict resolution
+- End-to-end question answering workflow
+- GAIA factoid format compliance
+**Out of scope (future enhancements):**
+- Reflection/ReAct patterns (mentioned in Level 3 dev record)
+- Multi-turn refinement
+- Self-critique loops
+- Advanced optimization (caching, streaming)

dev/dev_260101_04_level3_task_workflow_design.md CHANGED Viewed

@@ -14,23 +14,34 @@ Applied Level 3 Task & Workflow Design parameters from AI Agent System Design Fr
 ## Key Decisions
 **Parameter 1: Task Decomposition → Dynamic planning**
 - **Reasoning:** GAIA questions vary widely in complexity and required tool combinations
 - **Evidence:** Cannot use static pipeline - each question requires analyzing intent, then planning multi-step approach dynamically
 - **Implication:** Agent must generate execution plan per question based on question analysis
 **Parameter 2: Workflow Pattern → Sequential**
 - **Reasoning:** Agent follows linear reasoning chain with dependencies between steps
 - **Execution flow:** (1) Parse question → (2) Plan approach → (3) Execute tool calls → (4) Synthesize factoid answer
 - **Evidence:** Each step depends on previous step's output - no parallel execution needed
 - **Implication:** Sequential workflow pattern fits question-answering nature (vs routing/orchestrator-worker for multi-agent)
 **Rejected alternatives:**
 - Static pipeline: Cannot handle diverse GAIA question types requiring different tool combinations
 - Reactive decomposition: Less efficient than planning upfront for factoid question-answering
 - Parallel workflow: GAIA reasoning chains have linear dependencies
 - Routing pattern: Inappropriate for single-agent architecture (Level 2 decision)
 **Future experimentation:**
 - **Reflection pattern:** Self-critique and refinement loops for improved answer quality
 - **ReAct pattern:** Reasoning-Action interleaving for more adaptive execution
 - **Current MVP:** Sequential + Dynamic planning for baseline performance
@@ -40,9 +51,11 @@ Applied Level 3 Task & Workflow Design parameters from AI Agent System Design Fr
 Established MVP workflow architecture: Dynamic planning with sequential execution. Agent analyzes each question, generates step-by-step plan, executes tools sequentially, synthesizes factoid answer.
 **Deliverables:**
 - `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 workflow design decisions
 **Workflow Specifications:**
 - **Task Decomposition:** Dynamic planning per question
 - **Execution Pattern:** Sequential reasoning chain
 - **Future Enhancement:** Reflection/ReAct patterns for advanced iterations
@@ -58,6 +71,7 @@ Established MVP workflow architecture: Dynamic planning with sequential executio
 ## Changelog
 **What was changed:**
 - Created `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 task & workflow design decisions
 - Referenced AI Agent System Design Framework (2026-01-01).pdf Level 3 parameters
 - Documented future experimentation plans (Reflection/ReAct patterns)

 ## Key Decisions
 **Parameter 1: Task Decomposition → Dynamic planning**
 - **Reasoning:** GAIA questions vary widely in complexity and required tool combinations
 - **Evidence:** Cannot use static pipeline - each question requires analyzing intent, then planning multi-step approach dynamically
 - **Implication:** Agent must generate execution plan per question based on question analysis
 **Parameter 2: Workflow Pattern → Sequential**
 - **Reasoning:** Agent follows linear reasoning chain with dependencies between steps
 - **Execution flow:** (1) Parse question → (2) Plan approach → (3) Execute tool calls → (4) Synthesize factoid answer
 - **Evidence:** Each step depends on previous step's output - no parallel execution needed
 - **Implication:** Sequential workflow pattern fits question-answering nature (vs routing/orchestrator-worker for multi-agent)
+**Parameter 3: Task Prioritization → N/A (single task processing)**
+- **Reasoning:** GAIA benchmark processes one question at a time in zero-shot evaluation
+- **Evidence:** No multi-task scheduling required - agent answers one question per invocation
+- **Implication:** No task queue, priority system, or LLM-based scheduling needed
+- **Alignment:** Matches zero-shot stateless design (Level 1, Level 5)
 **Rejected alternatives:**
 - Static pipeline: Cannot handle diverse GAIA question types requiring different tool combinations
 - Reactive decomposition: Less efficient than planning upfront for factoid question-answering
 - Parallel workflow: GAIA reasoning chains have linear dependencies
 - Routing pattern: Inappropriate for single-agent architecture (Level 2 decision)
 **Future experimentation:**
 - **Reflection pattern:** Self-critique and refinement loops for improved answer quality
 - **ReAct pattern:** Reasoning-Action interleaving for more adaptive execution
 - **Current MVP:** Sequential + Dynamic planning for baseline performance
 Established MVP workflow architecture: Dynamic planning with sequential execution. Agent analyzes each question, generates step-by-step plan, executes tools sequentially, synthesizes factoid answer.
 **Deliverables:**
 - `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 workflow design decisions
 **Workflow Specifications:**
 - **Task Decomposition:** Dynamic planning per question
 - **Execution Pattern:** Sequential reasoning chain
 - **Future Enhancement:** Reflection/ReAct patterns for advanced iterations
 ## Changelog
 **What was changed:**
 - Created `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 task & workflow design decisions
 - Referenced AI Agent System Design Framework (2026-01-01).pdf Level 3 parameters
 - Documented future experimentation plans (Reflection/ReAct patterns)

dev/dev_260101_05_level4_agent_level_design.md CHANGED Viewed

@@ -35,6 +35,13 @@ Applied Level 4 Agent-Level Design parameters from AI Agent System Design Framew
 - **Reasoning:** Single-agent architecture eliminates need for inter-agent communication
 - **Implication:** No message passing, shared state, or event-driven protocols required
 **Rejected alternatives:**
 - Fine-grained specialists: Would require multi-agent architecture, rejected in Level 2
 - Simple Reflex agent: Insufficient reasoning capability for multi-step GAIA questions

 - **Reasoning:** Single-agent architecture eliminates need for inter-agent communication
 - **Implication:** No message passing, shared state, or event-driven protocols required
+**Parameter 5: Termination Logic → Fixed steps (3-node workflow)**
+- **Reasoning:** Sequential workflow (Level 3) defines clear termination point after answer_node
+- **Execution flow:** plan_node → execute_node → answer_node → END
+- **Evidence:** 3-node LangGraph workflow terminates after final answer synthesis
+- **Implication:** No LLM-based completion detection needed - workflow structure defines termination
+- **Alignment:** Matches sequential workflow pattern (Level 3)
 **Rejected alternatives:**
 - Fine-grained specialists: Would require multi-agent architecture, rejected in Level 2
 - Simple Reflex agent: Insufficient reasoning capability for multi-step GAIA questions

dev/dev_260101_06_level5_component_selection.md CHANGED Viewed

@@ -57,6 +57,20 @@ Applied Level 5 Component Selection parameters from AI Agent System Design Frame
 - **Minimal constraints:** No heavy content filtering for MVP (learning context)
 - **Safety focus:** Format compliance and execution safety, not content policy enforcement
 **Rejected alternatives:**
 - Vector stores/RAG: Unnecessary for stateless question-answering

 - **Minimal constraints:** No heavy content filtering for MVP (learning context)
 - **Safety focus:** Format compliance and execution safety, not content policy enforcement
+**Parameter 5: Answer Synthesis → LLM-generated (Stage 3 implementation)**
+- **Reasoning:** GAIA requires extracting factoid answers from multi-source evidence
+- **Evidence:** Answers must synthesize information from web searches, code outputs, file contents
+- **Implication:** LLM must reason about evidence and generate final answer (not template-based)
+- **Stage alignment:** Core logic implementation in Stage 3 (beyond MVP tool integration)
+- **Capability requirement:** LLM must distill complex evidence into concise factoid format
+**Parameter 6: Conflict Resolution → LLM-based reasoning (Stage 3 implementation)**
+- **Reasoning:** Multi-source evidence may contain conflicting information requiring judgment
+- **Example scenarios:** Conflicting search results, outdated vs current information, contradictory sources
+- **Implication:** LLM must evaluate source credibility and recency to resolve conflicts
+- **Stage alignment:** Decision logic in Stage 3 (not needed for Stage 2 tool integration)
+- **Alternative rejected:** Latest wins / Source priority too simplistic for GAIA evidence evaluation
 **Rejected alternatives:**
 - Vector stores/RAG: Unnecessary for stateless question-answering

dev/dev_260101_07_level6_implementation_framework.md CHANGED Viewed

@@ -52,6 +52,20 @@ Applied Level 6 Implementation Framework parameters from AI Agent System Design
   - Easy testing and tool swapping
 - **Implementation:** MCP server for tools (web/code/file/vision) + function calling interface
 **Rejected alternatives:**
 - Database-backed state: Violates stateless design, adds complexity
 - Distributed cache: Unnecessary for single-instance deployment

   - Easy testing and tool swapping
 - **Implementation:** MCP server for tools (web/code/file/vision) + function calling interface
+**Parameter 5: Tool Selection Mechanism → LLM function calling (Stage 3 implementation)**
+- **Reasoning:** Dynamic tool selection required for diverse GAIA question types
+- **Evidence:** Questions require different tool combinations - LLM must reason about which tools to invoke
+- **Implementation:** Claude function calling enables LLM to select appropriate tools based on question analysis
+- **Stage alignment:** Core decision logic in Stage 3 (beyond MVP tool integration)
+- **Alternative rejected:** Static routing insufficient - cannot predetermine tool sequences for all GAIA questions
+**Parameter 6: Parameter Extraction → LLM-based parsing (Stage 3 implementation)**
+- **Reasoning:** Tool parameters must be extracted from natural language questions
+- **Example:** Question "What's the population of Tokyo?" → extract "Tokyo" as location parameter for search tool
+- **Implementation:** LLM interprets question and generates appropriate tool parameters
+- **Stage alignment:** Decision logic in Stage 3 (LLM reasoning about parameter values)
+- **Alternative rejected:** Structured input not applicable - GAIA provides natural language questions, not structured data
 **Rejected alternatives:**
 - Database-backed state: Violates stateless design, adds complexity
 - Distributed cache: Unnecessary for single-instance deployment

src/agent/graph.py CHANGED Viewed

@@ -17,7 +17,8 @@ import logging
 from typing import TypedDict, List, Optional
 from langgraph.graph import StateGraph, END
 from src.config import Settings
-from src.tools import TOOLS
 # ============================================================================
 # Logging Setup
@@ -35,8 +36,11 @@ class AgentState(TypedDict):
     Tracks question processing from input through planning, execution, to final answer.
     """
     question: str                    # Input question from GAIA
     plan: Optional[str]              # Generated execution plan (Stage 3)
-    tool_calls: List[dict]           # Tool execution history (Stage 2)
     answer: Optional[str]            # Final factoid answer
     errors: List[str]                # Error messages from failures
@@ -49,8 +53,10 @@ def plan_node(state: AgentState) -> AgentState:
     """
     Planning node: Analyze question and generate execution plan.
-    Stage 2: Basic tool listing
     Stage 3: Dynamic planning with LLM
     Args:
         state: Current agent state with question
@@ -60,11 +66,21 @@ def plan_node(state: AgentState) -> AgentState:
     """
     logger.info(f"[plan_node] Question received: {state['question'][:100]}...")
-    # Stage 2: List available tools (dynamic planning in Stage 3)
-    tool_summary = ", ".join(TOOLS.keys())
-    state["plan"] = f"Stage 2: {len(TOOLS)} tools available ({tool_summary}). Dynamic planning in Stage 3."
-    logger.info(f"[plan_node] Plan created: {state['plan']}")
     return state
@@ -73,34 +89,90 @@ def execute_node(state: AgentState) -> AgentState:
     """
     Execution node: Execute tools based on plan.
-    Stage 2: Tool execution with error handling
-    Stage 3: Dynamic tool selection based on plan
     Args:
         state: Current agent state with plan
     Returns:
-        Updated state with tool execution results
     """
     logger.info(f"[execute_node] Executing tools - Plan: {state['plan'][:100]}...")
-    # Stage 2: Tools are available but no dynamic planning yet
-    # For now, just demonstrate tool registry is loaded
-    tool_calls = []
-    # Log available tools
-    for tool_name, tool_info in TOOLS.items():
-        logger.info(f"  Available tool: {tool_name} - {tool_info['description']}")
-        tool_calls.append({
-            "tool": tool_name,
-            "status": "ready",
-            "description": tool_info["description"],
-            "category": tool_info["category"]
-        })
-    state["tool_calls"] = tool_calls
-    logger.info(f"[execute_node] {len(tool_calls)} tools ready for Stage 3 dynamic execution")
     return state
@@ -109,22 +181,39 @@ def answer_node(state: AgentState) -> AgentState:
     """
     Answer synthesis node: Generate final factoid answer.
-    Stage 2: Summarize tool availability
-    Stage 3: Synthesize answer from tool execution results
     Args:
-        state: Current agent state with tool results
     Returns:
-        Updated state with final answer
     """
-    logger.info(f"[answer_node] Processing {len(state['tool_calls'])} tool results")
-    # Stage 2: Report tool readiness
-    ready_tools = [t["tool"] for t in state["tool_calls"] if t["status"] == "ready"]
-    state["answer"] = f"Stage 2 complete: {len(ready_tools)} tools ready for execution in Stage 3"
-    logger.info(f"[answer_node] Answer generated: {state['answer']}")
     return state
@@ -199,8 +288,11 @@ class GAIAAgent:
         # Initialize state
         initial_state: AgentState = {
             "question": question,
             "plan": None,
             "tool_calls": [],
             "answer": None,
             "errors": []
         }

 from typing import TypedDict, List, Optional
 from langgraph.graph import StateGraph, END
 from src.config import Settings
+from src.tools import TOOLS, search, parse_file, safe_eval, analyze_image
+from src.agent.llm_client import plan_question, select_tools_with_function_calling, synthesize_answer
 # ============================================================================
 # Logging Setup
     Tracks question processing from input through planning, execution, to final answer.
     """
     question: str                    # Input question from GAIA
+    file_paths: Optional[List[str]]  # Optional file paths for file-based questions
     plan: Optional[str]              # Generated execution plan (Stage 3)
+    tool_calls: List[dict]           # Tool invocation tracking (Stage 3)
+    tool_results: List[dict]         # Tool execution results (Stage 3)
+    evidence: List[str]              # Evidence collected from tools (Stage 3)
     answer: Optional[str]            # Final factoid answer
     errors: List[str]                # Error messages from failures
     """
     Planning node: Analyze question and generate execution plan.
     Stage 3: Dynamic planning with LLM
+    - LLM analyzes question and available tools
+    - Generates step-by-step execution plan
+    - Identifies which tools to use and in what order
     Args:
         state: Current agent state with question
     """
     logger.info(f"[plan_node] Question received: {state['question'][:100]}...")
+    try:
+        # Stage 3: Use LLM to generate dynamic execution plan
+        plan = plan_question(
+            question=state["question"],
+            available_tools=TOOLS,
+            file_paths=state.get("file_paths")
+        )
+        state["plan"] = plan
+        logger.info(f"[plan_node] Plan created ({len(plan)} chars)")
+    except Exception as e:
+        logger.error(f"[plan_node] Planning failed: {e}")
+        state["errors"].append(f"Planning error: {str(e)}")
+        state["plan"] = "Error: Unable to create plan"
     return state
     """
     Execution node: Execute tools based on plan.
+    Stage 3: Dynamic tool selection and execution
+    - LLM selects tools via function calling
+    - Extracts parameters from question
+    - Executes tools and collects results
+    - Handles errors with retry logic (in tools)
     Args:
         state: Current agent state with plan
     Returns:
+        Updated state with tool execution results and evidence
     """
     logger.info(f"[execute_node] Executing tools - Plan: {state['plan'][:100]}...")
+    # Map tool names to actual functions
+    TOOL_FUNCTIONS = {
+        "search": search,
+        "parse_file": parse_file,
+        "safe_eval": safe_eval,
+        "analyze_image": analyze_image
+    }
+    try:
+        # Stage 3: Use LLM function calling to select tools and extract parameters
+        tool_calls = select_tools_with_function_calling(
+            question=state["question"],
+            plan=state["plan"],
+            available_tools=TOOLS
+        )
+        logger.info(f"[execute_node] LLM selected {len(tool_calls)} tool(s) to execute")
+        # Execute each tool call
+        tool_results = []
+        evidence = []
+        for tool_call in tool_calls:
+            tool_name = tool_call["tool"]
+            params = tool_call["params"]
+            logger.info(f"[execute_node] Executing {tool_name} with params: {params}")
+            try:
+                # Get tool function
+                tool_func = TOOL_FUNCTIONS.get(tool_name)
+                if not tool_func:
+                    raise ValueError(f"Tool '{tool_name}' not found in TOOL_FUNCTIONS")
+                # Execute tool
+                result = tool_func(**params)
+                # Store result
+                tool_results.append({
+                    "tool": tool_name,
+                    "params": params,
+                    "result": result,
+                    "status": "success"
+                })
+                # Extract evidence
+                evidence.append(f"[{tool_name}] {result}")
+                logger.info(f"[execute_node] {tool_name} executed successfully")
+            except Exception as tool_error:
+                logger.error(f"[execute_node] Tool {tool_name} failed: {tool_error}")
+                tool_results.append({
+                    "tool": tool_name,
+                    "params": params,
+                    "error": str(tool_error),
+                    "status": "failed"
+                })
+                state["errors"].append(f"Tool {tool_name} failed: {str(tool_error)}")
+        # Update state
+        state["tool_calls"] = tool_calls
+        state["tool_results"] = tool_results
+        state["evidence"] = evidence
+        logger.info(f"[execute_node] Executed {len(tool_results)} tool(s), collected {len(evidence)} evidence items")
+    except Exception as e:
+        logger.error(f"[execute_node] Execution failed: {e}")
+        state["errors"].append(f"Execution error: {str(e)}")
     return state
     """
     Answer synthesis node: Generate final factoid answer.
+    Stage 3: Synthesize answer from evidence
+    - LLM analyzes collected evidence
+    - Resolves conflicts if present
+    - Generates factoid answer in GAIA format
     Args:
+        state: Current agent state with evidence from tools
     Returns:
+        Updated state with final factoid answer
     """
+    logger.info(f"[answer_node] Processing {len(state['evidence'])} evidence items")
+    try:
+        # Check if we have evidence
+        if not state["evidence"]:
+            logger.warning("[answer_node] No evidence collected, cannot generate answer")
+            state["answer"] = "Unable to answer: No evidence collected"
+            return state
+        # Stage 3: Use LLM to synthesize factoid answer from evidence
+        answer = synthesize_answer(
+            question=state["question"],
+            evidence=state["evidence"]
+        )
+        state["answer"] = answer
+        logger.info(f"[answer_node] Answer generated: {answer}")
+    except Exception as e:
+        logger.error(f"[answer_node] Answer synthesis failed: {e}")
+        state["errors"].append(f"Answer synthesis error: {str(e)}")
+        state["answer"] = "Error: Unable to generate answer"
     return state
         # Initialize state
         initial_state: AgentState = {
             "question": question,
+            "file_paths": None,
             "plan": None,
             "tool_calls": [],
+            "tool_results": [],
+            "evidence": [],
             "answer": None,
             "errors": []
         }

src/agent/llm_client.py ADDED Viewed

	@@ -0,0 +1,350 @@

+"""
+LLM Client Module - Centralized Claude API Interactions
+Author: @mangobee
+Date: 2026-01-02
+Handles all LLM calls for:
+- Planning (question analysis and execution plan generation)
+- Tool selection (function calling)
+- Answer synthesis (factoid answer generation from evidence)
+- Conflict resolution (evaluating contradictory information)
+Based on Level 5 decision: Claude Sonnet 4.5 as primary LLM
+Based on Level 6 decision: LLM function calling for tool selection
+"""
+import os
+import logging
+from typing import List, Dict, Optional, Any
+from anthropic import Anthropic
+# ============================================================================
+# CONFIG
+# ============================================================================
+# LLM Configuration
+MODEL_NAME = "claude-sonnet-4-5-20250929"
+TEMPERATURE = 0  # Deterministic for factoid answers
+MAX_TOKENS = 4096
+# ============================================================================
+# Logging Setup
+# ============================================================================
+logger = logging.getLogger(__name__)
+# ============================================================================
+# Client Initialization
+# ============================================================================
+def create_client() -> Anthropic:
+    """
+    Initialize Anthropic client with API key from environment.
+    Returns:
+        Anthropic client instance
+    Raises:
+        ValueError: If ANTHROPIC_API_KEY not set
+    """
+    api_key = os.getenv("ANTHROPIC_API_KEY")
+    if not api_key:
+        raise ValueError("ANTHROPIC_API_KEY environment variable not set")
+    logger.info(f"Initializing Anthropic client with model: {MODEL_NAME}")
+    return Anthropic(api_key=api_key)
+# ============================================================================
+# Planning Functions
+# ============================================================================
+def plan_question(
+    question: str,
+    available_tools: Dict[str, Dict],
+    file_paths: Optional[List[str]] = None
+) -> str:
+    """
+    Analyze question and generate execution plan using LLM.
+    LLM determines:
+    - Which tools are needed
+    - What order to execute them
+    - What parameters to extract from question
+    - Expected reasoning steps
+    Args:
+        question: GAIA question text
+        available_tools: Tool registry (name -> {description, category, parameters})
+        file_paths: Optional list of file paths for file-based questions
+    Returns:
+        Execution plan as structured text
+    """
+    client = create_client()
+    # Format tool information
+    tool_descriptions = []
+    for name, info in available_tools.items():
+        tool_descriptions.append(
+            f"- {name}: {info['description']} (Category: {info['category']})"
+        )
+    tools_text = "\n".join(tool_descriptions)
+    # File context
+    file_context = ""
+    if file_paths:
+        file_context = f"\n\nAvailable files:\n" + "\n".join([f"- {fp}" for fp in file_paths])
+    # Prompt for planning
+    system_prompt = """You are a planning agent for answering complex questions.
+Your task is to analyze the question and create a step-by-step execution plan.
+Consider:
+1. What information is needed to answer the question?
+2. Which tools can provide that information?
+3. In what order should tools be executed?
+4. What parameters need to be extracted from the question?
+Generate a concise plan with numbered steps."""
+    user_prompt = f"""Question: {question}{file_context}
+Available tools:
+{tools_text}
+Create an execution plan to answer this question. Format as numbered steps."""
+    logger.info(f"[plan_question] Calling LLM for planning")
+    response = client.messages.create(
+        model=MODEL_NAME,
+        max_tokens=MAX_TOKENS,
+        temperature=TEMPERATURE,
+        system=system_prompt,
+        messages=[{"role": "user", "content": user_prompt}]
+    )
+    plan = response.content[0].text
+    logger.info(f"[plan_question] Generated plan ({len(plan)} chars)")
+    return plan
+# ============================================================================
+# Tool Selection and Execution Functions
+# ============================================================================
+def select_tools_with_function_calling(
+    question: str,
+    plan: str,
+    available_tools: Dict[str, Dict]
+) -> List[Dict[str, Any]]:
+    """
+    Use Claude function calling to dynamically select tools and extract parameters.
+    LLM decides:
+    - Which tools to call
+    - What parameters to pass to each tool
+    - Order of tool execution
+    Args:
+        question: GAIA question text
+        plan: Execution plan from planning phase
+        available_tools: Tool registry
+    Returns:
+        List of tool calls with extracted parameters:
+        [{"tool": "search", "params": {"query": "..."}}, ...]
+    """
+    client = create_client()
+    # Convert tool registry to Claude function calling format
+    tool_schemas = []
+    for name, info in available_tools.items():
+        tool_schemas.append({
+            "name": name,
+            "description": info["description"],
+            "input_schema": {
+                "type": "object",
+                "properties": info.get("parameters", {}),
+                "required": info.get("required_params", [])
+            }
+        })
+    system_prompt = f"""You are a tool selection agent. Based on the question and execution plan, select appropriate tools to use.
+Execute the plan step by step. Call the necessary tools with correct parameters extracted from the question.
+Plan:
+{plan}"""
+    user_prompt = f"""Question: {question}
+Select and call the tools needed to answer this question according to the plan."""
+    logger.info(f"[select_tools] Calling LLM with function calling for {len(tool_schemas)} tools")
+    response = client.messages.create(
+        model=MODEL_NAME,
+        max_tokens=MAX_TOKENS,
+        temperature=TEMPERATURE,
+        system=system_prompt,
+        messages=[{"role": "user", "content": user_prompt}],
+        tools=tool_schemas
+    )
+    # Extract tool calls from response
+    tool_calls = []
+    for content_block in response.content:
+        if content_block.type == "tool_use":
+            tool_calls.append({
+                "tool": content_block.name,
+                "params": content_block.input,
+                "id": content_block.id
+            })
+    logger.info(f"[select_tools] LLM selected {len(tool_calls)} tool(s)")
+    return tool_calls
+# ============================================================================
+# Answer Synthesis Functions
+# ============================================================================
+def synthesize_answer(
+    question: str,
+    evidence: List[str]
+) -> str:
+    """
+    Synthesize factoid answer from collected evidence using LLM.
+    LLM must:
+    - Extract key information from evidence
+    - Resolve any conflicts between sources
+    - Format as factoid (number, few words, or comma-separated list)
+    - Return concise answer matching GAIA format requirements
+    Args:
+        question: Original GAIA question
+        evidence: List of evidence strings from tool executions
+    Returns:
+        Factoid answer string
+    """
+    client = create_client()
+    # Format evidence
+    evidence_text = "\n\n".join([f"Evidence {i+1}:\n{e}" for i, e in enumerate(evidence)])
+    system_prompt = """You are an answer synthesis agent for the GAIA benchmark.
+Your task is to extract a factoid answer from the provided evidence.
+CRITICAL - Answer format requirements:
+1. Answers must be factoids: a number, a few words, or a comma-separated list
+2. Be concise - no explanations, just the answer
+3. If evidence conflicts, evaluate source credibility and recency
+4. If evidence is insufficient, state "Unable to answer"
+Examples of good factoid answers:
+- "42"
+- "Paris"
+- "Albert Einstein"
+- "red, blue, green"
+- "1969-07-20"
+Examples of bad answers (too verbose):
+- "The answer is 42 because..."
+- "Based on the evidence, it appears that..."
+"""
+    user_prompt = f"""Question: {question}
+{evidence_text}
+Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
+    logger.info(f"[synthesize_answer] Calling LLM for answer synthesis from {len(evidence)} evidence items")
+    response = client.messages.create(
+        model=MODEL_NAME,
+        max_tokens=256,  # Factoid answers are short
+        temperature=TEMPERATURE,
+        system=system_prompt,
+        messages=[{"role": "user", "content": user_prompt}]
+    )
+    answer = response.content[0].text.strip()
+    logger.info(f"[synthesize_answer] Generated answer: {answer}")
+    return answer
+# ============================================================================
+# Conflict Resolution Functions
+# ============================================================================
+def resolve_conflicts(evidence: List[str]) -> Dict[str, Any]:
+    """
+    Detect and resolve conflicts in evidence using LLM reasoning.
+    Optional function for advanced conflict handling.
+    Currently integrated into synthesize_answer().
+    Args:
+        evidence: List of evidence strings that may conflict
+    Returns:
+        Dictionary with:
+        - has_conflicts: bool
+        - conflicts: List of identified conflicts
+        - resolution: Recommended resolution strategy
+    """
+    client = create_client()
+    evidence_text = "\n\n".join([f"Evidence {i+1}:\n{e}" for i, e in enumerate(evidence)])
+    system_prompt = """You are a conflict detection agent.
+Analyze the provided evidence and identify any contradictions or conflicts.
+Evaluate:
+1. Are there contradictory facts?
+2. Which sources are more credible?
+3. Which information is more recent?
+4. How should conflicts be resolved?"""
+    user_prompt = f"""Analyze this evidence for conflicts:
+{evidence_text}
+Respond in JSON format:
+{{
+    "has_conflicts": true/false,
+    "conflicts": ["description of conflict 1", ...],
+    "resolution": "recommended resolution strategy"
+}}"""
+    logger.info(f"[resolve_conflicts] Analyzing {len(evidence)} evidence items for conflicts")
+    response = client.messages.create(
+        model=MODEL_NAME,
+        max_tokens=MAX_TOKENS,
+        temperature=TEMPERATURE,
+        system=system_prompt,
+        messages=[{"role": "user", "content": user_prompt}]
+    )
+    # For MVP, return simple structure
+    # In production, would parse JSON from response
+    result = {
+        "has_conflicts": False,
+        "conflicts": [],
+        "resolution": response.content[0].text
+    }
+    logger.info(f"[resolve_conflicts] Analysis complete")
+    return result

test/test_llm_integration.py ADDED Viewed

	@@ -0,0 +1,287 @@

+"""
+LLM Integration Tests - Stage 3 Validation
+Author: @mangobee
+Date: 2026-01-02
+Tests for Stage 3 LLM integration:
+- Planning with LLM
+- Tool selection via function calling
+- Answer synthesis from evidence
+- Full workflow with mocked LLM responses
+"""
+import pytest
+from unittest.mock import patch, MagicMock
+from src.agent.llm_client import (
+    plan_question,
+    select_tools_with_function_calling,
+    synthesize_answer
+)
+from src.tools import TOOLS
+class TestPlanningFunction:
+    """Test LLM-based planning function."""
+    @patch('src.agent.llm_client.Anthropic')
+    def test_plan_question_basic(self, mock_anthropic):
+        """Test planning with simple question."""
+        # Mock LLM response
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        mock_response.content = [MagicMock(text="1. Search for information\n2. Analyze results")]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test planning
+        plan = plan_question(
+            question="What is the capital of France?",
+            available_tools=TOOLS
+        )
+        assert isinstance(plan, str)
+        assert len(plan) > 0
+        print(f"✓ Generated plan: {plan[:50]}...")
+    @patch('src.agent.llm_client.Anthropic')
+    def test_plan_with_files(self, mock_anthropic):
+        """Test planning with file context."""
+        # Mock LLM response
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        mock_response.content = [MagicMock(text="1. Parse file\n2. Extract data\n3. Calculate answer")]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test planning with files
+        plan = plan_question(
+            question="What is the total in the spreadsheet?",
+            available_tools=TOOLS,
+            file_paths=["data.xlsx"]
+        )
+        assert isinstance(plan, str)
+        assert len(plan) > 0
+        print(f"✓ Generated plan with files: {plan[:50]}...")
+class TestToolSelection:
+    """Test LLM function calling for tool selection."""
+    @patch('src.agent.llm_client.Anthropic')
+    def test_select_single_tool(self, mock_anthropic):
+        """Test selecting single tool with parameters."""
+        # Mock LLM response with function call
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        # Mock tool_use content block
+        mock_tool_use = MagicMock()
+        mock_tool_use.type = "tool_use"
+        mock_tool_use.name = "search"
+        mock_tool_use.input = {"query": "capital of France"}
+        mock_tool_use.id = "call_001"
+        mock_response.content = [mock_tool_use]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test tool selection
+        tool_calls = select_tools_with_function_calling(
+            question="What is the capital of France?",
+            plan="1. Search for capital of France",
+            available_tools=TOOLS
+        )
+        assert isinstance(tool_calls, list)
+        assert len(tool_calls) == 1
+        assert tool_calls[0]["tool"] == "search"
+        assert "query" in tool_calls[0]["params"]
+        print(f"✓ Selected tool: {tool_calls[0]}")
+    @patch('src.agent.llm_client.Anthropic')
+    def test_select_multiple_tools(self, mock_anthropic):
+        """Test selecting multiple tools in sequence."""
+        # Mock LLM response with multiple function calls
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        # Mock multiple tool_use blocks
+        mock_tool1 = MagicMock()
+        mock_tool1.type = "tool_use"
+        mock_tool1.name = "parse_file"
+        mock_tool1.input = {"file_path": "data.xlsx"}
+        mock_tool1.id = "call_001"
+        mock_tool2 = MagicMock()
+        mock_tool2.type = "tool_use"
+        mock_tool2.name = "safe_eval"
+        mock_tool2.input = {"expression": "sum(values)"}
+        mock_tool2.id = "call_002"
+        mock_response.content = [mock_tool1, mock_tool2]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test tool selection
+        tool_calls = select_tools_with_function_calling(
+            question="What is the sum in data.xlsx?",
+            plan="1. Parse file\n2. Calculate sum",
+            available_tools=TOOLS
+        )
+        assert isinstance(tool_calls, list)
+        assert len(tool_calls) == 2
+        assert tool_calls[0]["tool"] == "parse_file"
+        assert tool_calls[1]["tool"] == "safe_eval"
+        print(f"✓ Selected {len(tool_calls)} tools")
+class TestAnswerSynthesis:
+    """Test LLM-based answer synthesis."""
+    @patch('src.agent.llm_client.Anthropic')
+    def test_synthesize_simple_answer(self, mock_anthropic):
+        """Test synthesizing answer from single evidence."""
+        # Mock LLM response
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        mock_response.content = [MagicMock(text="Paris")]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test answer synthesis
+        answer = synthesize_answer(
+            question="What is the capital of France?",
+            evidence=["[search] Paris is the capital and most populous city of France"]
+        )
+        assert isinstance(answer, str)
+        assert len(answer) > 0
+        assert answer == "Paris"
+        print(f"✓ Synthesized answer: {answer}")
+    @patch('src.agent.llm_client.Anthropic')
+    def test_synthesize_from_multiple_evidence(self, mock_anthropic):
+        """Test synthesizing answer from multiple evidence sources."""
+        # Mock LLM response
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        mock_response.content = [MagicMock(text="42")]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test answer synthesis with multiple evidence
+        answer = synthesize_answer(
+            question="What is the answer?",
+            evidence=[
+                "[search] The answer to life is 42",
+                "[safe_eval] 6 * 7 = 42",
+                "[parse_file] Result: 42"
+            ]
+        )
+        assert isinstance(answer, str)
+        assert answer == "42"
+        print(f"✓ Synthesized answer from {3} evidence items: {answer}")
+    @patch('src.agent.llm_client.Anthropic')
+    def test_synthesize_with_conflicts(self, mock_anthropic):
+        """Test synthesizing answer when evidence conflicts."""
+        # Mock LLM response - should resolve conflict
+        mock_client = MagicMock()
+        mock_response = MagicMock()
+        mock_response.content = [MagicMock(text="Paris")]
+        mock_client.messages.create.return_value = mock_response
+        mock_anthropic.return_value = mock_client
+        # Test answer synthesis with conflicting evidence
+        answer = synthesize_answer(
+            question="What is the capital of France?",
+            evidence=[
+                "[search] Paris is the capital of France (source: Wikipedia, 2024)",
+                "[search] Lyon was briefly capital during revolution (source: old text, 1793)"
+            ]
+        )
+        assert isinstance(answer, str)
+        assert answer == "Paris"  # Should pick more recent/credible source
+        print(f"✓ Resolved conflict, answer: {answer}")
+class TestEndToEndWorkflow:
+    """Test full agent workflow with mocked LLM."""
+    @patch('src.agent.llm_client.Anthropic')
+    @patch('src.tools.web_search.tavily_search')
+    def test_full_search_workflow(self, mock_tavily, mock_anthropic):
+        """Test complete workflow: plan → search → answer."""
+        from src.agent import GAIAAgent
+        # Mock tool execution
+        mock_tavily.return_value = "Paris is the capital and most populous city of France"
+        # Mock LLM responses
+        mock_client = MagicMock()
+        # Response 1: Planning
+        # Response 2: Tool selection (function calling)
+        # Response 3: Answer synthesis
+        mock_plan_response = MagicMock()
+        mock_plan_response.content = [MagicMock(text="1. Search for capital of France")]
+        mock_tool_response = MagicMock()
+        mock_tool_use = MagicMock()
+        mock_tool_use.type = "tool_use"
+        mock_tool_use.name = "search"
+        mock_tool_use.input = {"query": "capital of France"}
+        mock_tool_use.id = "call_001"
+        mock_tool_response.content = [mock_tool_use]
+        mock_answer_response = MagicMock()
+        mock_answer_response.content = [MagicMock(text="Paris")]
+        # Set up mock to return different responses for each call
+        mock_client.messages.create.side_effect = [
+            mock_plan_response,
+            mock_tool_response,
+            mock_answer_response
+        ]
+        mock_anthropic.return_value = mock_client
+        # Test full workflow
+        agent = GAIAAgent()
+        answer = agent("What is the capital of France?")
+        assert isinstance(answer, str)
+        assert answer == "Paris"
+        print(f"✓ Full workflow completed, answer: {answer}")
+if __name__ == "__main__":
+    print("\n" + "="*70)
+    print("GAIA Agent - Stage 3 LLM Integration Tests")
+    print("="*70 + "\n")
+    # Run tests manually for quick validation
+    test_plan = TestPlanningFunction()
+    test_plan.test_plan_question_basic()
+    test_plan.test_plan_with_files()
+    test_tools = TestToolSelection()
+    test_tools.test_select_single_tool()
+    test_tools.test_select_multiple_tools()
+    test_answer = TestAnswerSynthesis()
+    test_answer.test_synthesize_simple_answer()
+    test_answer.test_synthesize_from_multiple_evidence()
+    test_answer.test_synthesize_with_conflicts()
+    test_e2e = TestEndToEndWorkflow()
+    test_e2e.test_full_search_workflow()
+    print("\n" + "="*70)
+    print("✓ All Stage 3 LLM integration tests passed!")
+    print("="*70 + "\n")

test/test_stage3_e2e.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+Stage 3 End-to-End Test with Real LLM API
+Author: @mangobee
+Date: 2026-01-02
+Manual test for Stage 3 workflow with actual Claude API.
+Requires ANTHROPIC_API_KEY environment variable.
+Usage:
+    ANTHROPIC_API_KEY=your_key uv run python test/test_stage3_e2e.py
+"""
+import os
+import sys
+from src.agent import GAIAAgent
+print("\n" + "="*70)
+print("Stage 3: End-to-End Test with Real LLM API")
+print("="*70 + "\n")
+# Check API key
+api_key = os.getenv("ANTHROPIC_API_KEY")
+if not api_key:
+    print("✗ ANTHROPIC_API_KEY not set. Skipping real API test.")
+    print("\nTo run this test:")
+    print("  export ANTHROPIC_API_KEY=your_key")
+    print("  uv run python test/test_stage3_e2e.py")
+    sys.exit(0)
+print("✓ ANTHROPIC_API_KEY found\n")
+# Test questions
+test_questions = [
+    {
+        "question": "What is 25 * 17?",
+        "expected_answer": "425",
+        "description": "Simple math (should use calculator)"
+    },
+    {
+        "question": "What is the capital of Japan?",
+        "expected_answer": "Tokyo",
+        "description": "Factual knowledge (should use search)"
+    }
+]
+print("Testing GAIA Agent with real questions...\n")
+for i, test in enumerate(test_questions, 1):
+    print(f"Test {i}: {test['description']}")
+    print(f"  Question: {test['question']}")
+    print(f"  Expected: {test['expected_answer']}")
+    try:
+        agent = GAIAAgent()
+        answer = agent(test['question'])
+        print(f"  Answer:   {answer}")
+        # Check if answer is reasonable
+        if test['expected_answer'].lower() in answer.lower():
+            print(f"  Status:   ✓ PASS - Answer contains expected value")
+        else:
+            print(f"  Status:   ⚠ PARTIAL - Answer may be correct but differs from expected")
+    except Exception as e:
+        print(f"  Status:   ✗ FAIL - Error: {e}")
+    print()
+print("="*70)
+print("✓ Stage 3 E2E test complete!")
+print("="*70 + "\n")
+print("Next steps:")
+print("1. Review answers above")
+print("2. If successful, deploy to HuggingFace Spaces")
+print("3. Test on full GAIA validation set")