mangubee Claude Sonnet 4.5 commited on
Commit
4c0b7eb
·
1 Parent(s): 87de1a7

Stage 3: Core Logic Implementation - LLM Integration

Browse files

Implemented Stage 3 core agent logic with full LLM integration:

**New Features:**
- LLM-based planning: Analyzes questions and generates execution plans
- Dynamic tool selection: Claude function calling for tool selection
- Parameter extraction: LLM extracts tool parameters from questions
- Answer synthesis: LLM generates factoid answers from evidence
- Conflict resolution: LLM evaluates contradictory information

**New Files:**
- src/agent/llm_client.py - Centralized LLM client
- test/test_llm_integration.py - 8 new LLM integration tests
- test/test_stage3_e2e.py - Manual E2E test script

**Modified Files:**
- src/agent/graph.py - Implemented all node logic (plan/execute/answer)
- AgentState schema: Added file_paths, tool_results, evidence fields

**Framework Updates:**
- Updated dev records (04-07) with new framework parameters
- Added missing parameters from framework v2026-01-02

**Test Results:**
- All 99 tests passing (6 Stage 1 + 85 Stage 2 + 8 Stage 3)
- No regressions from previous stages

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

CHANGELOG.md CHANGED
@@ -1,22 +1,51 @@
1
  # Session Changelog
2
 
3
- **Session Date:** [YYYY-MM-DD]
4
- **Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
5
 
6
  ## Changes Made
7
 
8
  ### Created Files
9
 
10
- - [file path] - [Purpose/description]
 
 
11
 
12
  ### Modified Files
13
 
14
- - [file path] - [What was changed]
 
 
 
15
 
16
  ### Deleted Files
17
 
18
- - [file path] - [Reason for deletion]
19
 
20
  ## Notes
21
 
22
- [Any additional context about the session's work]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Session Changelog
2
 
3
+ **Session Date:** 2026-01-02
4
+ **Dev Record:** dev/dev_260102_14_stage3_core_logic.md
5
 
6
  ## Changes Made
7
 
8
  ### Created Files
9
 
10
+ - `src/agent/llm_client.py` - Centralized LLM client for planning, tool selection, and answer synthesis
11
+ - `test/test_llm_integration.py` - 8 tests for LLM integration (planning, tool selection, answer synthesis)
12
+ - `test/test_stage3_e2e.py` - Manual E2E test script for real API testing
13
 
14
  ### Modified Files
15
 
16
+ - `src/agent/graph.py` - Updated AgentState schema, implemented Stage 3 logic in all nodes (plan/execute/answer)
17
+ - `PLAN.md` - Created implementation plan for Stage 3
18
+ - `TODO.md` - Created task tracking list for Stage 3
19
+ - `requirements.txt` - Already includes anthropic>=0.39.0
20
 
21
  ### Deleted Files
22
 
23
+ - None
24
 
25
  ## Notes
26
 
27
+ Stage 3 Core Logic Implementation:
28
+
29
+ **State Schema Updates:**
30
+
31
+ - Added new state fields: file_paths, tool_results, evidence
32
+
33
+ **Node Implementations:**
34
+
35
+ - plan_node: LLM-based planning with dynamic tool selection
36
+ - execute_node: LLM function calling for tool selection and parameter extraction
37
+ - answer_node: LLM-based answer synthesis with conflict resolution
38
+
39
+ **LLM Integration:**
40
+
41
+ - All three nodes now use Claude Sonnet 4.5 for dynamic decision-making
42
+ - Centralized LLM client in src/agent/llm_client.py
43
+ - Functions: plan_question, select_tools_with_function_calling, synthesize_answer
44
+
45
+ **Testing:**
46
+
47
+ - Added 8 new Stage 3 tests (test_llm_integration.py)
48
+ - All 99 tests passing (Stage 1: 6, Stage 2: 85, Stage 3: 8)
49
+ - Created manual E2E test script for real API testing
50
+
51
+ Next steps: Deploy to HuggingFace Spaces and verify with actual GAIA questions
PLAN.md CHANGED
@@ -1,21 +1,191 @@
1
- # Implementation Plan
2
 
3
- **Date:** [YYYY-MM-DD]
4
- **Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
5
- **Status:** [Planning | In Progress | Completed]
6
 
7
  ## Objective
8
 
9
- [Clear goal statement]
10
 
11
  ## Steps
12
 
13
- [Implementation steps]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Files to Modify
16
 
17
- [List of files]
 
 
 
 
 
 
18
 
19
  ## Success Criteria
20
 
21
- [Completion criteria]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Implementation Plan - Stage 3: Core Logic Implementation
2
 
3
+ **Date:** 2026-01-02
4
+ **Dev Record:** dev/dev_260102_14_stage3_core_logic.md
5
+ **Status:** Planning
6
 
7
  ## Objective
8
 
9
+ Implement Stage 3 core agent logic: LLM-based tool selection, parameter extraction, answer synthesis, and conflict resolution to complete the GAIA benchmark agent MVP.
10
 
11
  ## Steps
12
 
13
+ ### 1. Update Agent State Schema
14
+
15
+ **File:** `src/agent/state.py`
16
+
17
+ **Changes:**
18
+
19
+ - Add `plan: str` field for execution plan from planning node
20
+ - Add `tool_calls: List[Dict]` field for tracking tool invocations
21
+ - Add `tool_results: List[Dict]` field for storing tool outputs
22
+ - Add `evidence: List[str]` field for collecting information from tools
23
+ - Add `conflicts: List[Dict]` field for tracking conflicting information (optional)
24
+
25
+ ### 2. Implement Planning Node Logic
26
+
27
+ **File:** `src/agent/graph.py` - Update `plan_node` function
28
+
29
+ **Current:** Placeholder that sets plan to "Stage 1 complete"
30
+
31
+ **New logic:**
32
+
33
+ - Accept `question` and `file_paths` from state
34
+ - Use LLM to analyze question and determine required tools
35
+ - Generate step-by-step execution plan
36
+ - Identify which tools to use and what parameters to extract
37
+ - Update state with execution plan
38
+ - Return updated state
39
+
40
+ **LLM Prompt Strategy:**
41
+
42
+ - System: "You are a planning agent. Analyze the question and create an execution plan."
43
+ - User: Provide question, available tools (from TOOLS registry), file information
44
+ - Expected output: Structured plan with tool selection reasoning
45
+
46
+ ### 3. Implement Execute Node Logic
47
+
48
+ **File:** `src/agent/graph.py` - Update `execute_node` function
49
+
50
+ **Current:** Reports "Stage 2 complete: 4 tools ready"
51
+
52
+ **New logic:**
53
+
54
+ - Use LLM function calling to dynamically select tools
55
+ - Extract parameters from question using LLM
56
+ - Execute selected tools sequentially based on plan
57
+ - Collect results in `tool_results` field
58
+ - Extract evidence from each tool result
59
+ - Handle tool failures with retry logic (already in tools)
60
+ - Update state with tool results and evidence
61
+ - Return updated state
62
+
63
+ **LLM Function Calling Strategy:**
64
+
65
+ - Define tool schemas for Claude function calling
66
+ - Let LLM decide which tools to invoke based on question
67
+ - LLM extracts parameters from question automatically
68
+ - Execute tool calls and collect results
69
+
70
+ ### 4. Implement Answer Node Logic
71
+
72
+ **File:** `src/agent/graph.py` - Update `answer_node` function
73
+
74
+ **Current:** Placeholder that returns "This is a placeholder answer"
75
+
76
+ **New logic:**
77
+
78
+ - Accept evidence from execute node
79
+ - Use LLM to synthesize factoid answer from evidence
80
+ - Detect and resolve conflicts in evidence (LLM-based reasoning)
81
+ - Format answer according to GAIA requirements (factoid: number/few words/comma-separated)
82
+ - Update state with final answer
83
+ - Return updated state
84
+
85
+ **LLM Answer Synthesis Strategy:**
86
+
87
+ - System: "You are an answer synthesizer. Extract factoid answer from evidence."
88
+ - User: Provide all evidence, specify factoid format requirements
89
+ - Conflict resolution: If evidence conflicts, LLM evaluates source credibility/recency
90
+ - Expected output: Concise factoid answer
91
+
92
+ ### 5. Configure LLM Client
93
+
94
+ **File:** `src/agent/llm_client.py` (NEW)
95
+
96
+ **Purpose:** Centralized LLM interaction for all nodes
97
+
98
+ **Functions:**
99
+
100
+ - `create_client()` - Initialize Anthropic client
101
+ - `plan_question(question, tools, files)` - Call LLM for planning
102
+ - `select_and_execute_tools(question, plan, tools)` - Function calling for tool selection
103
+ - `synthesize_answer(question, evidence)` - Call LLM for answer synthesis
104
+ - `resolve_conflicts(evidence)` - Call LLM for conflict resolution (optional)
105
+
106
+ **Configuration:**
107
+
108
+ - Use Claude Sonnet 4.5 (as per Level 5 decision)
109
+ - API key from environment variable
110
+ - Temperature: 0 for deterministic answers
111
+ - Max tokens: 4096 for reasoning
112
+
113
+ ### 6. Update Test Suite
114
+
115
+ **Files:**
116
+
117
+ - `test/test_agent.py` - Update agent tests
118
+ - `test/test_llm_integration.py` (NEW) - Test LLM interactions with mocks
119
+
120
+ **Test cases:**
121
+
122
+ - Test planning node generates valid execution plan
123
+ - Test execute node calls correct tools with correct parameters
124
+ - Test answer node synthesizes factoid answer
125
+ - Test conflict resolution logic
126
+ - Test end-to-end agent workflow with mock LLM responses
127
+ - Test error handling (tool failures, LLM timeouts)
128
+
129
+ ### 7. Update Requirements
130
+
131
+ **File:** `requirements.txt`
132
+
133
+ **Add:**
134
+
135
+ - `anthropic>=0.40.0` - Claude API client
136
+
137
+ ### 8. Deploy and Verify
138
+
139
+ **Actions:**
140
+
141
+ - Commit and push to HuggingFace Spaces
142
+ - Verify build succeeds
143
+ - Test agent with sample GAIA questions
144
+ - Verify output format matches GAIA requirements
145
 
146
  ## Files to Modify
147
 
148
+ 1. `src/agent/state.py` - Expand state schema for Stage 3
149
+ 2. `src/agent/graph.py` - Implement plan/execute/answer node logic
150
+ 3. `src/agent/llm_client.py` - NEW - Centralized LLM client
151
+ 4. `test/test_agent.py` - Update tests for Stage 3
152
+ 5. `test/test_llm_integration.py` - NEW - LLM integration tests
153
+ 6. `requirements.txt` - Add anthropic library
154
+ 7. `pyproject.toml` - Install anthropic via uv
155
 
156
  ## Success Criteria
157
 
158
+ - [ ] Planning node analyzes question and generates execution plan using LLM
159
+ - [ ] Execute node dynamically selects tools using LLM function calling
160
+ - [ ] Execute node extracts parameters from questions automatically
161
+ - [ ] Execute node executes tools and collects evidence
162
+ - [ ] Answer node synthesizes factoid answer from evidence
163
+ - [ ] Conflict resolution handles contradictory information
164
+ - [ ] All Stage 1 + Stage 2 tests still pass (97 tests)
165
+ - [ ] New Stage 3 tests pass (minimum 10 new tests)
166
+ - [ ] Agent successfully answers sample GAIA questions end-to-end
167
+ - [ ] Output format matches GAIA factoid requirements
168
+ - [ ] Deployment to HuggingFace Spaces succeeds
169
+
170
+ ## Design Alignment
171
+
172
+ **Level 3:** Dynamic planning with sequential execution ✓
173
+ **Level 4:** Goal-based reasoning, termination after answer_node ✓
174
+ **Level 5:** LLM-generated answer synthesis, LLM-based conflict resolution ✓
175
+ **Level 6:** LLM function calling for tool selection, LLM-based parameter extraction ✓
176
+
177
+ ## Stage 3 Scope
178
+
179
+ **In scope:**
180
+
181
+ - LLM-based planning, tool selection, parameter extraction
182
+ - Answer synthesis and conflict resolution
183
+ - End-to-end question answering workflow
184
+ - GAIA factoid format compliance
185
+
186
+ **Out of scope (future enhancements):**
187
+
188
+ - Reflection/ReAct patterns (mentioned in Level 3 dev record)
189
+ - Multi-turn refinement
190
+ - Self-critique loops
191
+ - Advanced optimization (caching, streaming)
dev/dev_260101_04_level3_task_workflow_design.md CHANGED
@@ -14,23 +14,34 @@ Applied Level 3 Task & Workflow Design parameters from AI Agent System Design Fr
14
  ## Key Decisions
15
 
16
  **Parameter 1: Task Decomposition → Dynamic planning**
 
17
  - **Reasoning:** GAIA questions vary widely in complexity and required tool combinations
18
  - **Evidence:** Cannot use static pipeline - each question requires analyzing intent, then planning multi-step approach dynamically
19
  - **Implication:** Agent must generate execution plan per question based on question analysis
20
 
21
  **Parameter 2: Workflow Pattern → Sequential**
 
22
  - **Reasoning:** Agent follows linear reasoning chain with dependencies between steps
23
  - **Execution flow:** (1) Parse question → (2) Plan approach → (3) Execute tool calls → (4) Synthesize factoid answer
24
  - **Evidence:** Each step depends on previous step's output - no parallel execution needed
25
  - **Implication:** Sequential workflow pattern fits question-answering nature (vs routing/orchestrator-worker for multi-agent)
26
 
 
 
 
 
 
 
 
27
  **Rejected alternatives:**
 
28
  - Static pipeline: Cannot handle diverse GAIA question types requiring different tool combinations
29
  - Reactive decomposition: Less efficient than planning upfront for factoid question-answering
30
  - Parallel workflow: GAIA reasoning chains have linear dependencies
31
  - Routing pattern: Inappropriate for single-agent architecture (Level 2 decision)
32
 
33
  **Future experimentation:**
 
34
  - **Reflection pattern:** Self-critique and refinement loops for improved answer quality
35
  - **ReAct pattern:** Reasoning-Action interleaving for more adaptive execution
36
  - **Current MVP:** Sequential + Dynamic planning for baseline performance
@@ -40,9 +51,11 @@ Applied Level 3 Task & Workflow Design parameters from AI Agent System Design Fr
40
  Established MVP workflow architecture: Dynamic planning with sequential execution. Agent analyzes each question, generates step-by-step plan, executes tools sequentially, synthesizes factoid answer.
41
 
42
  **Deliverables:**
 
43
  - `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 workflow design decisions
44
 
45
  **Workflow Specifications:**
 
46
  - **Task Decomposition:** Dynamic planning per question
47
  - **Execution Pattern:** Sequential reasoning chain
48
  - **Future Enhancement:** Reflection/ReAct patterns for advanced iterations
@@ -58,6 +71,7 @@ Established MVP workflow architecture: Dynamic planning with sequential executio
58
  ## Changelog
59
 
60
  **What was changed:**
 
61
  - Created `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 task & workflow design decisions
62
  - Referenced AI Agent System Design Framework (2026-01-01).pdf Level 3 parameters
63
  - Documented future experimentation plans (Reflection/ReAct patterns)
 
14
  ## Key Decisions
15
 
16
  **Parameter 1: Task Decomposition → Dynamic planning**
17
+
18
  - **Reasoning:** GAIA questions vary widely in complexity and required tool combinations
19
  - **Evidence:** Cannot use static pipeline - each question requires analyzing intent, then planning multi-step approach dynamically
20
  - **Implication:** Agent must generate execution plan per question based on question analysis
21
 
22
  **Parameter 2: Workflow Pattern → Sequential**
23
+
24
  - **Reasoning:** Agent follows linear reasoning chain with dependencies between steps
25
  - **Execution flow:** (1) Parse question → (2) Plan approach → (3) Execute tool calls → (4) Synthesize factoid answer
26
  - **Evidence:** Each step depends on previous step's output - no parallel execution needed
27
  - **Implication:** Sequential workflow pattern fits question-answering nature (vs routing/orchestrator-worker for multi-agent)
28
 
29
+ **Parameter 3: Task Prioritization → N/A (single task processing)**
30
+
31
+ - **Reasoning:** GAIA benchmark processes one question at a time in zero-shot evaluation
32
+ - **Evidence:** No multi-task scheduling required - agent answers one question per invocation
33
+ - **Implication:** No task queue, priority system, or LLM-based scheduling needed
34
+ - **Alignment:** Matches zero-shot stateless design (Level 1, Level 5)
35
+
36
  **Rejected alternatives:**
37
+
38
  - Static pipeline: Cannot handle diverse GAIA question types requiring different tool combinations
39
  - Reactive decomposition: Less efficient than planning upfront for factoid question-answering
40
  - Parallel workflow: GAIA reasoning chains have linear dependencies
41
  - Routing pattern: Inappropriate for single-agent architecture (Level 2 decision)
42
 
43
  **Future experimentation:**
44
+
45
  - **Reflection pattern:** Self-critique and refinement loops for improved answer quality
46
  - **ReAct pattern:** Reasoning-Action interleaving for more adaptive execution
47
  - **Current MVP:** Sequential + Dynamic planning for baseline performance
 
51
  Established MVP workflow architecture: Dynamic planning with sequential execution. Agent analyzes each question, generates step-by-step plan, executes tools sequentially, synthesizes factoid answer.
52
 
53
  **Deliverables:**
54
+
55
  - `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 workflow design decisions
56
 
57
  **Workflow Specifications:**
58
+
59
  - **Task Decomposition:** Dynamic planning per question
60
  - **Execution Pattern:** Sequential reasoning chain
61
  - **Future Enhancement:** Reflection/ReAct patterns for advanced iterations
 
71
  ## Changelog
72
 
73
  **What was changed:**
74
+
75
  - Created `dev/dev_260101_04_level3_task_workflow_design.md` - Level 3 task & workflow design decisions
76
  - Referenced AI Agent System Design Framework (2026-01-01).pdf Level 3 parameters
77
  - Documented future experimentation plans (Reflection/ReAct patterns)
dev/dev_260101_05_level4_agent_level_design.md CHANGED
@@ -35,6 +35,13 @@ Applied Level 4 Agent-Level Design parameters from AI Agent System Design Framew
35
  - **Reasoning:** Single-agent architecture eliminates need for inter-agent communication
36
  - **Implication:** No message passing, shared state, or event-driven protocols required
37
 
 
 
 
 
 
 
 
38
  **Rejected alternatives:**
39
  - Fine-grained specialists: Would require multi-agent architecture, rejected in Level 2
40
  - Simple Reflex agent: Insufficient reasoning capability for multi-step GAIA questions
 
35
  - **Reasoning:** Single-agent architecture eliminates need for inter-agent communication
36
  - **Implication:** No message passing, shared state, or event-driven protocols required
37
 
38
+ **Parameter 5: Termination Logic → Fixed steps (3-node workflow)**
39
+ - **Reasoning:** Sequential workflow (Level 3) defines clear termination point after answer_node
40
+ - **Execution flow:** plan_node → execute_node → answer_node → END
41
+ - **Evidence:** 3-node LangGraph workflow terminates after final answer synthesis
42
+ - **Implication:** No LLM-based completion detection needed - workflow structure defines termination
43
+ - **Alignment:** Matches sequential workflow pattern (Level 3)
44
+
45
  **Rejected alternatives:**
46
  - Fine-grained specialists: Would require multi-agent architecture, rejected in Level 2
47
  - Simple Reflex agent: Insufficient reasoning capability for multi-step GAIA questions
dev/dev_260101_06_level5_component_selection.md CHANGED
@@ -57,6 +57,20 @@ Applied Level 5 Component Selection parameters from AI Agent System Design Frame
57
  - **Minimal constraints:** No heavy content filtering for MVP (learning context)
58
  - **Safety focus:** Format compliance and execution safety, not content policy enforcement
59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  **Rejected alternatives:**
61
 
62
  - Vector stores/RAG: Unnecessary for stateless question-answering
 
57
  - **Minimal constraints:** No heavy content filtering for MVP (learning context)
58
  - **Safety focus:** Format compliance and execution safety, not content policy enforcement
59
 
60
+ **Parameter 5: Answer Synthesis → LLM-generated (Stage 3 implementation)**
61
+ - **Reasoning:** GAIA requires extracting factoid answers from multi-source evidence
62
+ - **Evidence:** Answers must synthesize information from web searches, code outputs, file contents
63
+ - **Implication:** LLM must reason about evidence and generate final answer (not template-based)
64
+ - **Stage alignment:** Core logic implementation in Stage 3 (beyond MVP tool integration)
65
+ - **Capability requirement:** LLM must distill complex evidence into concise factoid format
66
+
67
+ **Parameter 6: Conflict Resolution → LLM-based reasoning (Stage 3 implementation)**
68
+ - **Reasoning:** Multi-source evidence may contain conflicting information requiring judgment
69
+ - **Example scenarios:** Conflicting search results, outdated vs current information, contradictory sources
70
+ - **Implication:** LLM must evaluate source credibility and recency to resolve conflicts
71
+ - **Stage alignment:** Decision logic in Stage 3 (not needed for Stage 2 tool integration)
72
+ - **Alternative rejected:** Latest wins / Source priority too simplistic for GAIA evidence evaluation
73
+
74
  **Rejected alternatives:**
75
 
76
  - Vector stores/RAG: Unnecessary for stateless question-answering
dev/dev_260101_07_level6_implementation_framework.md CHANGED
@@ -52,6 +52,20 @@ Applied Level 6 Implementation Framework parameters from AI Agent System Design
52
  - Easy testing and tool swapping
53
  - **Implementation:** MCP server for tools (web/code/file/vision) + function calling interface
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  **Rejected alternatives:**
56
  - Database-backed state: Violates stateless design, adds complexity
57
  - Distributed cache: Unnecessary for single-instance deployment
 
52
  - Easy testing and tool swapping
53
  - **Implementation:** MCP server for tools (web/code/file/vision) + function calling interface
54
 
55
+ **Parameter 5: Tool Selection Mechanism → LLM function calling (Stage 3 implementation)**
56
+ - **Reasoning:** Dynamic tool selection required for diverse GAIA question types
57
+ - **Evidence:** Questions require different tool combinations - LLM must reason about which tools to invoke
58
+ - **Implementation:** Claude function calling enables LLM to select appropriate tools based on question analysis
59
+ - **Stage alignment:** Core decision logic in Stage 3 (beyond MVP tool integration)
60
+ - **Alternative rejected:** Static routing insufficient - cannot predetermine tool sequences for all GAIA questions
61
+
62
+ **Parameter 6: Parameter Extraction → LLM-based parsing (Stage 3 implementation)**
63
+ - **Reasoning:** Tool parameters must be extracted from natural language questions
64
+ - **Example:** Question "What's the population of Tokyo?" → extract "Tokyo" as location parameter for search tool
65
+ - **Implementation:** LLM interprets question and generates appropriate tool parameters
66
+ - **Stage alignment:** Decision logic in Stage 3 (LLM reasoning about parameter values)
67
+ - **Alternative rejected:** Structured input not applicable - GAIA provides natural language questions, not structured data
68
+
69
  **Rejected alternatives:**
70
  - Database-backed state: Violates stateless design, adds complexity
71
  - Distributed cache: Unnecessary for single-instance deployment
src/agent/graph.py CHANGED
@@ -17,7 +17,8 @@ import logging
17
  from typing import TypedDict, List, Optional
18
  from langgraph.graph import StateGraph, END
19
  from src.config import Settings
20
- from src.tools import TOOLS
 
21
 
22
  # ============================================================================
23
  # Logging Setup
@@ -35,8 +36,11 @@ class AgentState(TypedDict):
35
  Tracks question processing from input through planning, execution, to final answer.
36
  """
37
  question: str # Input question from GAIA
 
38
  plan: Optional[str] # Generated execution plan (Stage 3)
39
- tool_calls: List[dict] # Tool execution history (Stage 2)
 
 
40
  answer: Optional[str] # Final factoid answer
41
  errors: List[str] # Error messages from failures
42
 
@@ -49,8 +53,10 @@ def plan_node(state: AgentState) -> AgentState:
49
  """
50
  Planning node: Analyze question and generate execution plan.
51
 
52
- Stage 2: Basic tool listing
53
  Stage 3: Dynamic planning with LLM
 
 
 
54
 
55
  Args:
56
  state: Current agent state with question
@@ -60,11 +66,21 @@ def plan_node(state: AgentState) -> AgentState:
60
  """
61
  logger.info(f"[plan_node] Question received: {state['question'][:100]}...")
62
 
63
- # Stage 2: List available tools (dynamic planning in Stage 3)
64
- tool_summary = ", ".join(TOOLS.keys())
65
- state["plan"] = f"Stage 2: {len(TOOLS)} tools available ({tool_summary}). Dynamic planning in Stage 3."
 
 
 
 
66
 
67
- logger.info(f"[plan_node] Plan created: {state['plan']}")
 
 
 
 
 
 
68
 
69
  return state
70
 
@@ -73,34 +89,90 @@ def execute_node(state: AgentState) -> AgentState:
73
  """
74
  Execution node: Execute tools based on plan.
75
 
76
- Stage 2: Tool execution with error handling
77
- Stage 3: Dynamic tool selection based on plan
 
 
 
78
 
79
  Args:
80
  state: Current agent state with plan
81
 
82
  Returns:
83
- Updated state with tool execution results
84
  """
85
  logger.info(f"[execute_node] Executing tools - Plan: {state['plan'][:100]}...")
86
 
87
- # Stage 2: Tools are available but no dynamic planning yet
88
- # For now, just demonstrate tool registry is loaded
89
- tool_calls = []
90
-
91
- # Log available tools
92
- for tool_name, tool_info in TOOLS.items():
93
- logger.info(f" Available tool: {tool_name} - {tool_info['description']}")
94
- tool_calls.append({
95
- "tool": tool_name,
96
- "status": "ready",
97
- "description": tool_info["description"],
98
- "category": tool_info["category"]
99
- })
100
-
101
- state["tool_calls"] = tool_calls
102
-
103
- logger.info(f"[execute_node] {len(tool_calls)} tools ready for Stage 3 dynamic execution")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
 
105
  return state
106
 
@@ -109,22 +181,39 @@ def answer_node(state: AgentState) -> AgentState:
109
  """
110
  Answer synthesis node: Generate final factoid answer.
111
 
112
- Stage 2: Summarize tool availability
113
- Stage 3: Synthesize answer from tool execution results
 
 
114
 
115
  Args:
116
- state: Current agent state with tool results
117
 
118
  Returns:
119
- Updated state with final answer
120
  """
121
- logger.info(f"[answer_node] Processing {len(state['tool_calls'])} tool results")
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
- # Stage 2: Report tool readiness
124
- ready_tools = [t["tool"] for t in state["tool_calls"] if t["status"] == "ready"]
125
- state["answer"] = f"Stage 2 complete: {len(ready_tools)} tools ready for execution in Stage 3"
126
 
127
- logger.info(f"[answer_node] Answer generated: {state['answer']}")
 
 
 
128
 
129
  return state
130
 
@@ -199,8 +288,11 @@ class GAIAAgent:
199
  # Initialize state
200
  initial_state: AgentState = {
201
  "question": question,
 
202
  "plan": None,
203
  "tool_calls": [],
 
 
204
  "answer": None,
205
  "errors": []
206
  }
 
17
  from typing import TypedDict, List, Optional
18
  from langgraph.graph import StateGraph, END
19
  from src.config import Settings
20
+ from src.tools import TOOLS, search, parse_file, safe_eval, analyze_image
21
+ from src.agent.llm_client import plan_question, select_tools_with_function_calling, synthesize_answer
22
 
23
  # ============================================================================
24
  # Logging Setup
 
36
  Tracks question processing from input through planning, execution, to final answer.
37
  """
38
  question: str # Input question from GAIA
39
+ file_paths: Optional[List[str]] # Optional file paths for file-based questions
40
  plan: Optional[str] # Generated execution plan (Stage 3)
41
+ tool_calls: List[dict] # Tool invocation tracking (Stage 3)
42
+ tool_results: List[dict] # Tool execution results (Stage 3)
43
+ evidence: List[str] # Evidence collected from tools (Stage 3)
44
  answer: Optional[str] # Final factoid answer
45
  errors: List[str] # Error messages from failures
46
 
 
53
  """
54
  Planning node: Analyze question and generate execution plan.
55
 
 
56
  Stage 3: Dynamic planning with LLM
57
+ - LLM analyzes question and available tools
58
+ - Generates step-by-step execution plan
59
+ - Identifies which tools to use and in what order
60
 
61
  Args:
62
  state: Current agent state with question
 
66
  """
67
  logger.info(f"[plan_node] Question received: {state['question'][:100]}...")
68
 
69
+ try:
70
+ # Stage 3: Use LLM to generate dynamic execution plan
71
+ plan = plan_question(
72
+ question=state["question"],
73
+ available_tools=TOOLS,
74
+ file_paths=state.get("file_paths")
75
+ )
76
 
77
+ state["plan"] = plan
78
+ logger.info(f"[plan_node] Plan created ({len(plan)} chars)")
79
+
80
+ except Exception as e:
81
+ logger.error(f"[plan_node] Planning failed: {e}")
82
+ state["errors"].append(f"Planning error: {str(e)}")
83
+ state["plan"] = "Error: Unable to create plan"
84
 
85
  return state
86
 
 
89
  """
90
  Execution node: Execute tools based on plan.
91
 
92
+ Stage 3: Dynamic tool selection and execution
93
+ - LLM selects tools via function calling
94
+ - Extracts parameters from question
95
+ - Executes tools and collects results
96
+ - Handles errors with retry logic (in tools)
97
 
98
  Args:
99
  state: Current agent state with plan
100
 
101
  Returns:
102
+ Updated state with tool execution results and evidence
103
  """
104
  logger.info(f"[execute_node] Executing tools - Plan: {state['plan'][:100]}...")
105
 
106
+ # Map tool names to actual functions
107
+ TOOL_FUNCTIONS = {
108
+ "search": search,
109
+ "parse_file": parse_file,
110
+ "safe_eval": safe_eval,
111
+ "analyze_image": analyze_image
112
+ }
113
+
114
+ try:
115
+ # Stage 3: Use LLM function calling to select tools and extract parameters
116
+ tool_calls = select_tools_with_function_calling(
117
+ question=state["question"],
118
+ plan=state["plan"],
119
+ available_tools=TOOLS
120
+ )
121
+
122
+ logger.info(f"[execute_node] LLM selected {len(tool_calls)} tool(s) to execute")
123
+
124
+ # Execute each tool call
125
+ tool_results = []
126
+ evidence = []
127
+
128
+ for tool_call in tool_calls:
129
+ tool_name = tool_call["tool"]
130
+ params = tool_call["params"]
131
+
132
+ logger.info(f"[execute_node] Executing {tool_name} with params: {params}")
133
+
134
+ try:
135
+ # Get tool function
136
+ tool_func = TOOL_FUNCTIONS.get(tool_name)
137
+ if not tool_func:
138
+ raise ValueError(f"Tool '{tool_name}' not found in TOOL_FUNCTIONS")
139
+
140
+ # Execute tool
141
+ result = tool_func(**params)
142
+
143
+ # Store result
144
+ tool_results.append({
145
+ "tool": tool_name,
146
+ "params": params,
147
+ "result": result,
148
+ "status": "success"
149
+ })
150
+
151
+ # Extract evidence
152
+ evidence.append(f"[{tool_name}] {result}")
153
+
154
+ logger.info(f"[execute_node] {tool_name} executed successfully")
155
+
156
+ except Exception as tool_error:
157
+ logger.error(f"[execute_node] Tool {tool_name} failed: {tool_error}")
158
+ tool_results.append({
159
+ "tool": tool_name,
160
+ "params": params,
161
+ "error": str(tool_error),
162
+ "status": "failed"
163
+ })
164
+ state["errors"].append(f"Tool {tool_name} failed: {str(tool_error)}")
165
+
166
+ # Update state
167
+ state["tool_calls"] = tool_calls
168
+ state["tool_results"] = tool_results
169
+ state["evidence"] = evidence
170
+
171
+ logger.info(f"[execute_node] Executed {len(tool_results)} tool(s), collected {len(evidence)} evidence items")
172
+
173
+ except Exception as e:
174
+ logger.error(f"[execute_node] Execution failed: {e}")
175
+ state["errors"].append(f"Execution error: {str(e)}")
176
 
177
  return state
178
 
 
181
  """
182
  Answer synthesis node: Generate final factoid answer.
183
 
184
+ Stage 3: Synthesize answer from evidence
185
+ - LLM analyzes collected evidence
186
+ - Resolves conflicts if present
187
+ - Generates factoid answer in GAIA format
188
 
189
  Args:
190
+ state: Current agent state with evidence from tools
191
 
192
  Returns:
193
+ Updated state with final factoid answer
194
  """
195
+ logger.info(f"[answer_node] Processing {len(state['evidence'])} evidence items")
196
+
197
+ try:
198
+ # Check if we have evidence
199
+ if not state["evidence"]:
200
+ logger.warning("[answer_node] No evidence collected, cannot generate answer")
201
+ state["answer"] = "Unable to answer: No evidence collected"
202
+ return state
203
+
204
+ # Stage 3: Use LLM to synthesize factoid answer from evidence
205
+ answer = synthesize_answer(
206
+ question=state["question"],
207
+ evidence=state["evidence"]
208
+ )
209
 
210
+ state["answer"] = answer
211
+ logger.info(f"[answer_node] Answer generated: {answer}")
 
212
 
213
+ except Exception as e:
214
+ logger.error(f"[answer_node] Answer synthesis failed: {e}")
215
+ state["errors"].append(f"Answer synthesis error: {str(e)}")
216
+ state["answer"] = "Error: Unable to generate answer"
217
 
218
  return state
219
 
 
288
  # Initialize state
289
  initial_state: AgentState = {
290
  "question": question,
291
+ "file_paths": None,
292
  "plan": None,
293
  "tool_calls": [],
294
+ "tool_results": [],
295
+ "evidence": [],
296
  "answer": None,
297
  "errors": []
298
  }
src/agent/llm_client.py ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LLM Client Module - Centralized Claude API Interactions
3
+ Author: @mangobee
4
+ Date: 2026-01-02
5
+
6
+ Handles all LLM calls for:
7
+ - Planning (question analysis and execution plan generation)
8
+ - Tool selection (function calling)
9
+ - Answer synthesis (factoid answer generation from evidence)
10
+ - Conflict resolution (evaluating contradictory information)
11
+
12
+ Based on Level 5 decision: Claude Sonnet 4.5 as primary LLM
13
+ Based on Level 6 decision: LLM function calling for tool selection
14
+ """
15
+
16
+ import os
17
+ import logging
18
+ from typing import List, Dict, Optional, Any
19
+ from anthropic import Anthropic
20
+
21
+ # ============================================================================
22
+ # CONFIG
23
+ # ============================================================================
24
+
25
+ # LLM Configuration
26
+ MODEL_NAME = "claude-sonnet-4-5-20250929"
27
+ TEMPERATURE = 0 # Deterministic for factoid answers
28
+ MAX_TOKENS = 4096
29
+
30
+ # ============================================================================
31
+ # Logging Setup
32
+ # ============================================================================
33
+ logger = logging.getLogger(__name__)
34
+
35
+ # ============================================================================
36
+ # Client Initialization
37
+ # ============================================================================
38
+
39
+ def create_client() -> Anthropic:
40
+ """
41
+ Initialize Anthropic client with API key from environment.
42
+
43
+ Returns:
44
+ Anthropic client instance
45
+
46
+ Raises:
47
+ ValueError: If ANTHROPIC_API_KEY not set
48
+ """
49
+ api_key = os.getenv("ANTHROPIC_API_KEY")
50
+ if not api_key:
51
+ raise ValueError("ANTHROPIC_API_KEY environment variable not set")
52
+
53
+ logger.info(f"Initializing Anthropic client with model: {MODEL_NAME}")
54
+ return Anthropic(api_key=api_key)
55
+
56
+
57
+ # ============================================================================
58
+ # Planning Functions
59
+ # ============================================================================
60
+
61
+ def plan_question(
62
+ question: str,
63
+ available_tools: Dict[str, Dict],
64
+ file_paths: Optional[List[str]] = None
65
+ ) -> str:
66
+ """
67
+ Analyze question and generate execution plan using LLM.
68
+
69
+ LLM determines:
70
+ - Which tools are needed
71
+ - What order to execute them
72
+ - What parameters to extract from question
73
+ - Expected reasoning steps
74
+
75
+ Args:
76
+ question: GAIA question text
77
+ available_tools: Tool registry (name -> {description, category, parameters})
78
+ file_paths: Optional list of file paths for file-based questions
79
+
80
+ Returns:
81
+ Execution plan as structured text
82
+ """
83
+ client = create_client()
84
+
85
+ # Format tool information
86
+ tool_descriptions = []
87
+ for name, info in available_tools.items():
88
+ tool_descriptions.append(
89
+ f"- {name}: {info['description']} (Category: {info['category']})"
90
+ )
91
+ tools_text = "\n".join(tool_descriptions)
92
+
93
+ # File context
94
+ file_context = ""
95
+ if file_paths:
96
+ file_context = f"\n\nAvailable files:\n" + "\n".join([f"- {fp}" for fp in file_paths])
97
+
98
+ # Prompt for planning
99
+ system_prompt = """You are a planning agent for answering complex questions.
100
+
101
+ Your task is to analyze the question and create a step-by-step execution plan.
102
+
103
+ Consider:
104
+ 1. What information is needed to answer the question?
105
+ 2. Which tools can provide that information?
106
+ 3. In what order should tools be executed?
107
+ 4. What parameters need to be extracted from the question?
108
+
109
+ Generate a concise plan with numbered steps."""
110
+
111
+ user_prompt = f"""Question: {question}{file_context}
112
+
113
+ Available tools:
114
+ {tools_text}
115
+
116
+ Create an execution plan to answer this question. Format as numbered steps."""
117
+
118
+ logger.info(f"[plan_question] Calling LLM for planning")
119
+
120
+ response = client.messages.create(
121
+ model=MODEL_NAME,
122
+ max_tokens=MAX_TOKENS,
123
+ temperature=TEMPERATURE,
124
+ system=system_prompt,
125
+ messages=[{"role": "user", "content": user_prompt}]
126
+ )
127
+
128
+ plan = response.content[0].text
129
+ logger.info(f"[plan_question] Generated plan ({len(plan)} chars)")
130
+
131
+ return plan
132
+
133
+
134
+ # ============================================================================
135
+ # Tool Selection and Execution Functions
136
+ # ============================================================================
137
+
138
+ def select_tools_with_function_calling(
139
+ question: str,
140
+ plan: str,
141
+ available_tools: Dict[str, Dict]
142
+ ) -> List[Dict[str, Any]]:
143
+ """
144
+ Use Claude function calling to dynamically select tools and extract parameters.
145
+
146
+ LLM decides:
147
+ - Which tools to call
148
+ - What parameters to pass to each tool
149
+ - Order of tool execution
150
+
151
+ Args:
152
+ question: GAIA question text
153
+ plan: Execution plan from planning phase
154
+ available_tools: Tool registry
155
+
156
+ Returns:
157
+ List of tool calls with extracted parameters:
158
+ [{"tool": "search", "params": {"query": "..."}}, ...]
159
+ """
160
+ client = create_client()
161
+
162
+ # Convert tool registry to Claude function calling format
163
+ tool_schemas = []
164
+ for name, info in available_tools.items():
165
+ tool_schemas.append({
166
+ "name": name,
167
+ "description": info["description"],
168
+ "input_schema": {
169
+ "type": "object",
170
+ "properties": info.get("parameters", {}),
171
+ "required": info.get("required_params", [])
172
+ }
173
+ })
174
+
175
+ system_prompt = f"""You are a tool selection agent. Based on the question and execution plan, select appropriate tools to use.
176
+
177
+ Execute the plan step by step. Call the necessary tools with correct parameters extracted from the question.
178
+
179
+ Plan:
180
+ {plan}"""
181
+
182
+ user_prompt = f"""Question: {question}
183
+
184
+ Select and call the tools needed to answer this question according to the plan."""
185
+
186
+ logger.info(f"[select_tools] Calling LLM with function calling for {len(tool_schemas)} tools")
187
+
188
+ response = client.messages.create(
189
+ model=MODEL_NAME,
190
+ max_tokens=MAX_TOKENS,
191
+ temperature=TEMPERATURE,
192
+ system=system_prompt,
193
+ messages=[{"role": "user", "content": user_prompt}],
194
+ tools=tool_schemas
195
+ )
196
+
197
+ # Extract tool calls from response
198
+ tool_calls = []
199
+ for content_block in response.content:
200
+ if content_block.type == "tool_use":
201
+ tool_calls.append({
202
+ "tool": content_block.name,
203
+ "params": content_block.input,
204
+ "id": content_block.id
205
+ })
206
+
207
+ logger.info(f"[select_tools] LLM selected {len(tool_calls)} tool(s)")
208
+
209
+ return tool_calls
210
+
211
+
212
+ # ============================================================================
213
+ # Answer Synthesis Functions
214
+ # ============================================================================
215
+
216
+ def synthesize_answer(
217
+ question: str,
218
+ evidence: List[str]
219
+ ) -> str:
220
+ """
221
+ Synthesize factoid answer from collected evidence using LLM.
222
+
223
+ LLM must:
224
+ - Extract key information from evidence
225
+ - Resolve any conflicts between sources
226
+ - Format as factoid (number, few words, or comma-separated list)
227
+ - Return concise answer matching GAIA format requirements
228
+
229
+ Args:
230
+ question: Original GAIA question
231
+ evidence: List of evidence strings from tool executions
232
+
233
+ Returns:
234
+ Factoid answer string
235
+ """
236
+ client = create_client()
237
+
238
+ # Format evidence
239
+ evidence_text = "\n\n".join([f"Evidence {i+1}:\n{e}" for i, e in enumerate(evidence)])
240
+
241
+ system_prompt = """You are an answer synthesis agent for the GAIA benchmark.
242
+
243
+ Your task is to extract a factoid answer from the provided evidence.
244
+
245
+ CRITICAL - Answer format requirements:
246
+ 1. Answers must be factoids: a number, a few words, or a comma-separated list
247
+ 2. Be concise - no explanations, just the answer
248
+ 3. If evidence conflicts, evaluate source credibility and recency
249
+ 4. If evidence is insufficient, state "Unable to answer"
250
+
251
+ Examples of good factoid answers:
252
+ - "42"
253
+ - "Paris"
254
+ - "Albert Einstein"
255
+ - "red, blue, green"
256
+ - "1969-07-20"
257
+
258
+ Examples of bad answers (too verbose):
259
+ - "The answer is 42 because..."
260
+ - "Based on the evidence, it appears that..."
261
+ """
262
+
263
+ user_prompt = f"""Question: {question}
264
+
265
+ {evidence_text}
266
+
267
+ Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
268
+
269
+ logger.info(f"[synthesize_answer] Calling LLM for answer synthesis from {len(evidence)} evidence items")
270
+
271
+ response = client.messages.create(
272
+ model=MODEL_NAME,
273
+ max_tokens=256, # Factoid answers are short
274
+ temperature=TEMPERATURE,
275
+ system=system_prompt,
276
+ messages=[{"role": "user", "content": user_prompt}]
277
+ )
278
+
279
+ answer = response.content[0].text.strip()
280
+ logger.info(f"[synthesize_answer] Generated answer: {answer}")
281
+
282
+ return answer
283
+
284
+
285
+ # ============================================================================
286
+ # Conflict Resolution Functions
287
+ # ============================================================================
288
+
289
+ def resolve_conflicts(evidence: List[str]) -> Dict[str, Any]:
290
+ """
291
+ Detect and resolve conflicts in evidence using LLM reasoning.
292
+
293
+ Optional function for advanced conflict handling.
294
+ Currently integrated into synthesize_answer().
295
+
296
+ Args:
297
+ evidence: List of evidence strings that may conflict
298
+
299
+ Returns:
300
+ Dictionary with:
301
+ - has_conflicts: bool
302
+ - conflicts: List of identified conflicts
303
+ - resolution: Recommended resolution strategy
304
+ """
305
+ client = create_client()
306
+
307
+ evidence_text = "\n\n".join([f"Evidence {i+1}:\n{e}" for i, e in enumerate(evidence)])
308
+
309
+ system_prompt = """You are a conflict detection agent.
310
+
311
+ Analyze the provided evidence and identify any contradictions or conflicts.
312
+
313
+ Evaluate:
314
+ 1. Are there contradictory facts?
315
+ 2. Which sources are more credible?
316
+ 3. Which information is more recent?
317
+ 4. How should conflicts be resolved?"""
318
+
319
+ user_prompt = f"""Analyze this evidence for conflicts:
320
+
321
+ {evidence_text}
322
+
323
+ Respond in JSON format:
324
+ {{
325
+ "has_conflicts": true/false,
326
+ "conflicts": ["description of conflict 1", ...],
327
+ "resolution": "recommended resolution strategy"
328
+ }}"""
329
+
330
+ logger.info(f"[resolve_conflicts] Analyzing {len(evidence)} evidence items for conflicts")
331
+
332
+ response = client.messages.create(
333
+ model=MODEL_NAME,
334
+ max_tokens=MAX_TOKENS,
335
+ temperature=TEMPERATURE,
336
+ system=system_prompt,
337
+ messages=[{"role": "user", "content": user_prompt}]
338
+ )
339
+
340
+ # For MVP, return simple structure
341
+ # In production, would parse JSON from response
342
+ result = {
343
+ "has_conflicts": False,
344
+ "conflicts": [],
345
+ "resolution": response.content[0].text
346
+ }
347
+
348
+ logger.info(f"[resolve_conflicts] Analysis complete")
349
+
350
+ return result
test/test_llm_integration.py ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ LLM Integration Tests - Stage 3 Validation
3
+ Author: @mangobee
4
+ Date: 2026-01-02
5
+
6
+ Tests for Stage 3 LLM integration:
7
+ - Planning with LLM
8
+ - Tool selection via function calling
9
+ - Answer synthesis from evidence
10
+ - Full workflow with mocked LLM responses
11
+ """
12
+
13
+ import pytest
14
+ from unittest.mock import patch, MagicMock
15
+ from src.agent.llm_client import (
16
+ plan_question,
17
+ select_tools_with_function_calling,
18
+ synthesize_answer
19
+ )
20
+ from src.tools import TOOLS
21
+
22
+
23
+ class TestPlanningFunction:
24
+ """Test LLM-based planning function."""
25
+
26
+ @patch('src.agent.llm_client.Anthropic')
27
+ def test_plan_question_basic(self, mock_anthropic):
28
+ """Test planning with simple question."""
29
+ # Mock LLM response
30
+ mock_client = MagicMock()
31
+ mock_response = MagicMock()
32
+ mock_response.content = [MagicMock(text="1. Search for information\n2. Analyze results")]
33
+ mock_client.messages.create.return_value = mock_response
34
+ mock_anthropic.return_value = mock_client
35
+
36
+ # Test planning
37
+ plan = plan_question(
38
+ question="What is the capital of France?",
39
+ available_tools=TOOLS
40
+ )
41
+
42
+ assert isinstance(plan, str)
43
+ assert len(plan) > 0
44
+ print(f"✓ Generated plan: {plan[:50]}...")
45
+
46
+ @patch('src.agent.llm_client.Anthropic')
47
+ def test_plan_with_files(self, mock_anthropic):
48
+ """Test planning with file context."""
49
+ # Mock LLM response
50
+ mock_client = MagicMock()
51
+ mock_response = MagicMock()
52
+ mock_response.content = [MagicMock(text="1. Parse file\n2. Extract data\n3. Calculate answer")]
53
+ mock_client.messages.create.return_value = mock_response
54
+ mock_anthropic.return_value = mock_client
55
+
56
+ # Test planning with files
57
+ plan = plan_question(
58
+ question="What is the total in the spreadsheet?",
59
+ available_tools=TOOLS,
60
+ file_paths=["data.xlsx"]
61
+ )
62
+
63
+ assert isinstance(plan, str)
64
+ assert len(plan) > 0
65
+ print(f"✓ Generated plan with files: {plan[:50]}...")
66
+
67
+
68
+ class TestToolSelection:
69
+ """Test LLM function calling for tool selection."""
70
+
71
+ @patch('src.agent.llm_client.Anthropic')
72
+ def test_select_single_tool(self, mock_anthropic):
73
+ """Test selecting single tool with parameters."""
74
+ # Mock LLM response with function call
75
+ mock_client = MagicMock()
76
+ mock_response = MagicMock()
77
+
78
+ # Mock tool_use content block
79
+ mock_tool_use = MagicMock()
80
+ mock_tool_use.type = "tool_use"
81
+ mock_tool_use.name = "search"
82
+ mock_tool_use.input = {"query": "capital of France"}
83
+ mock_tool_use.id = "call_001"
84
+
85
+ mock_response.content = [mock_tool_use]
86
+ mock_client.messages.create.return_value = mock_response
87
+ mock_anthropic.return_value = mock_client
88
+
89
+ # Test tool selection
90
+ tool_calls = select_tools_with_function_calling(
91
+ question="What is the capital of France?",
92
+ plan="1. Search for capital of France",
93
+ available_tools=TOOLS
94
+ )
95
+
96
+ assert isinstance(tool_calls, list)
97
+ assert len(tool_calls) == 1
98
+ assert tool_calls[0]["tool"] == "search"
99
+ assert "query" in tool_calls[0]["params"]
100
+ print(f"✓ Selected tool: {tool_calls[0]}")
101
+
102
+ @patch('src.agent.llm_client.Anthropic')
103
+ def test_select_multiple_tools(self, mock_anthropic):
104
+ """Test selecting multiple tools in sequence."""
105
+ # Mock LLM response with multiple function calls
106
+ mock_client = MagicMock()
107
+ mock_response = MagicMock()
108
+
109
+ # Mock multiple tool_use blocks
110
+ mock_tool1 = MagicMock()
111
+ mock_tool1.type = "tool_use"
112
+ mock_tool1.name = "parse_file"
113
+ mock_tool1.input = {"file_path": "data.xlsx"}
114
+ mock_tool1.id = "call_001"
115
+
116
+ mock_tool2 = MagicMock()
117
+ mock_tool2.type = "tool_use"
118
+ mock_tool2.name = "safe_eval"
119
+ mock_tool2.input = {"expression": "sum(values)"}
120
+ mock_tool2.id = "call_002"
121
+
122
+ mock_response.content = [mock_tool1, mock_tool2]
123
+ mock_client.messages.create.return_value = mock_response
124
+ mock_anthropic.return_value = mock_client
125
+
126
+ # Test tool selection
127
+ tool_calls = select_tools_with_function_calling(
128
+ question="What is the sum in data.xlsx?",
129
+ plan="1. Parse file\n2. Calculate sum",
130
+ available_tools=TOOLS
131
+ )
132
+
133
+ assert isinstance(tool_calls, list)
134
+ assert len(tool_calls) == 2
135
+ assert tool_calls[0]["tool"] == "parse_file"
136
+ assert tool_calls[1]["tool"] == "safe_eval"
137
+ print(f"✓ Selected {len(tool_calls)} tools")
138
+
139
+
140
+ class TestAnswerSynthesis:
141
+ """Test LLM-based answer synthesis."""
142
+
143
+ @patch('src.agent.llm_client.Anthropic')
144
+ def test_synthesize_simple_answer(self, mock_anthropic):
145
+ """Test synthesizing answer from single evidence."""
146
+ # Mock LLM response
147
+ mock_client = MagicMock()
148
+ mock_response = MagicMock()
149
+ mock_response.content = [MagicMock(text="Paris")]
150
+ mock_client.messages.create.return_value = mock_response
151
+ mock_anthropic.return_value = mock_client
152
+
153
+ # Test answer synthesis
154
+ answer = synthesize_answer(
155
+ question="What is the capital of France?",
156
+ evidence=["[search] Paris is the capital and most populous city of France"]
157
+ )
158
+
159
+ assert isinstance(answer, str)
160
+ assert len(answer) > 0
161
+ assert answer == "Paris"
162
+ print(f"✓ Synthesized answer: {answer}")
163
+
164
+ @patch('src.agent.llm_client.Anthropic')
165
+ def test_synthesize_from_multiple_evidence(self, mock_anthropic):
166
+ """Test synthesizing answer from multiple evidence sources."""
167
+ # Mock LLM response
168
+ mock_client = MagicMock()
169
+ mock_response = MagicMock()
170
+ mock_response.content = [MagicMock(text="42")]
171
+ mock_client.messages.create.return_value = mock_response
172
+ mock_anthropic.return_value = mock_client
173
+
174
+ # Test answer synthesis with multiple evidence
175
+ answer = synthesize_answer(
176
+ question="What is the answer?",
177
+ evidence=[
178
+ "[search] The answer to life is 42",
179
+ "[safe_eval] 6 * 7 = 42",
180
+ "[parse_file] Result: 42"
181
+ ]
182
+ )
183
+
184
+ assert isinstance(answer, str)
185
+ assert answer == "42"
186
+ print(f"✓ Synthesized answer from {3} evidence items: {answer}")
187
+
188
+ @patch('src.agent.llm_client.Anthropic')
189
+ def test_synthesize_with_conflicts(self, mock_anthropic):
190
+ """Test synthesizing answer when evidence conflicts."""
191
+ # Mock LLM response - should resolve conflict
192
+ mock_client = MagicMock()
193
+ mock_response = MagicMock()
194
+ mock_response.content = [MagicMock(text="Paris")]
195
+ mock_client.messages.create.return_value = mock_response
196
+ mock_anthropic.return_value = mock_client
197
+
198
+ # Test answer synthesis with conflicting evidence
199
+ answer = synthesize_answer(
200
+ question="What is the capital of France?",
201
+ evidence=[
202
+ "[search] Paris is the capital of France (source: Wikipedia, 2024)",
203
+ "[search] Lyon was briefly capital during revolution (source: old text, 1793)"
204
+ ]
205
+ )
206
+
207
+ assert isinstance(answer, str)
208
+ assert answer == "Paris" # Should pick more recent/credible source
209
+ print(f"✓ Resolved conflict, answer: {answer}")
210
+
211
+
212
+ class TestEndToEndWorkflow:
213
+ """Test full agent workflow with mocked LLM."""
214
+
215
+ @patch('src.agent.llm_client.Anthropic')
216
+ @patch('src.tools.web_search.tavily_search')
217
+ def test_full_search_workflow(self, mock_tavily, mock_anthropic):
218
+ """Test complete workflow: plan → search → answer."""
219
+ from src.agent import GAIAAgent
220
+
221
+ # Mock tool execution
222
+ mock_tavily.return_value = "Paris is the capital and most populous city of France"
223
+
224
+ # Mock LLM responses
225
+ mock_client = MagicMock()
226
+
227
+ # Response 1: Planning
228
+ # Response 2: Tool selection (function calling)
229
+ # Response 3: Answer synthesis
230
+
231
+ mock_plan_response = MagicMock()
232
+ mock_plan_response.content = [MagicMock(text="1. Search for capital of France")]
233
+
234
+ mock_tool_response = MagicMock()
235
+ mock_tool_use = MagicMock()
236
+ mock_tool_use.type = "tool_use"
237
+ mock_tool_use.name = "search"
238
+ mock_tool_use.input = {"query": "capital of France"}
239
+ mock_tool_use.id = "call_001"
240
+ mock_tool_response.content = [mock_tool_use]
241
+
242
+ mock_answer_response = MagicMock()
243
+ mock_answer_response.content = [MagicMock(text="Paris")]
244
+
245
+ # Set up mock to return different responses for each call
246
+ mock_client.messages.create.side_effect = [
247
+ mock_plan_response,
248
+ mock_tool_response,
249
+ mock_answer_response
250
+ ]
251
+
252
+ mock_anthropic.return_value = mock_client
253
+
254
+ # Test full workflow
255
+ agent = GAIAAgent()
256
+ answer = agent("What is the capital of France?")
257
+
258
+ assert isinstance(answer, str)
259
+ assert answer == "Paris"
260
+ print(f"✓ Full workflow completed, answer: {answer}")
261
+
262
+
263
+ if __name__ == "__main__":
264
+ print("\n" + "="*70)
265
+ print("GAIA Agent - Stage 3 LLM Integration Tests")
266
+ print("="*70 + "\n")
267
+
268
+ # Run tests manually for quick validation
269
+ test_plan = TestPlanningFunction()
270
+ test_plan.test_plan_question_basic()
271
+ test_plan.test_plan_with_files()
272
+
273
+ test_tools = TestToolSelection()
274
+ test_tools.test_select_single_tool()
275
+ test_tools.test_select_multiple_tools()
276
+
277
+ test_answer = TestAnswerSynthesis()
278
+ test_answer.test_synthesize_simple_answer()
279
+ test_answer.test_synthesize_from_multiple_evidence()
280
+ test_answer.test_synthesize_with_conflicts()
281
+
282
+ test_e2e = TestEndToEndWorkflow()
283
+ test_e2e.test_full_search_workflow()
284
+
285
+ print("\n" + "="*70)
286
+ print("✓ All Stage 3 LLM integration tests passed!")
287
+ print("="*70 + "\n")
test/test_stage3_e2e.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Stage 3 End-to-End Test with Real LLM API
3
+ Author: @mangobee
4
+ Date: 2026-01-02
5
+
6
+ Manual test for Stage 3 workflow with actual Claude API.
7
+ Requires ANTHROPIC_API_KEY environment variable.
8
+
9
+ Usage:
10
+ ANTHROPIC_API_KEY=your_key uv run python test/test_stage3_e2e.py
11
+ """
12
+
13
+ import os
14
+ import sys
15
+ from src.agent import GAIAAgent
16
+
17
+ print("\n" + "="*70)
18
+ print("Stage 3: End-to-End Test with Real LLM API")
19
+ print("="*70 + "\n")
20
+
21
+ # Check API key
22
+ api_key = os.getenv("ANTHROPIC_API_KEY")
23
+ if not api_key:
24
+ print("✗ ANTHROPIC_API_KEY not set. Skipping real API test.")
25
+ print("\nTo run this test:")
26
+ print(" export ANTHROPIC_API_KEY=your_key")
27
+ print(" uv run python test/test_stage3_e2e.py")
28
+ sys.exit(0)
29
+
30
+ print("✓ ANTHROPIC_API_KEY found\n")
31
+
32
+ # Test questions
33
+ test_questions = [
34
+ {
35
+ "question": "What is 25 * 17?",
36
+ "expected_answer": "425",
37
+ "description": "Simple math (should use calculator)"
38
+ },
39
+ {
40
+ "question": "What is the capital of Japan?",
41
+ "expected_answer": "Tokyo",
42
+ "description": "Factual knowledge (should use search)"
43
+ }
44
+ ]
45
+
46
+ print("Testing GAIA Agent with real questions...\n")
47
+
48
+ for i, test in enumerate(test_questions, 1):
49
+ print(f"Test {i}: {test['description']}")
50
+ print(f" Question: {test['question']}")
51
+ print(f" Expected: {test['expected_answer']}")
52
+
53
+ try:
54
+ agent = GAIAAgent()
55
+ answer = agent(test['question'])
56
+
57
+ print(f" Answer: {answer}")
58
+
59
+ # Check if answer is reasonable
60
+ if test['expected_answer'].lower() in answer.lower():
61
+ print(f" Status: ✓ PASS - Answer contains expected value")
62
+ else:
63
+ print(f" Status: ⚠ PARTIAL - Answer may be correct but differs from expected")
64
+
65
+ except Exception as e:
66
+ print(f" Status: ✗ FAIL - Error: {e}")
67
+
68
+ print()
69
+
70
+ print("="*70)
71
+ print("✓ Stage 3 E2E test complete!")
72
+ print("="*70 + "\n")
73
+
74
+ print("Next steps:")
75
+ print("1. Review answers above")
76
+ print("2. If successful, deploy to HuggingFace Spaces")
77
+ print("3. Test on full GAIA validation set")