mangubee Claude Sonnet 4.5 commited on
Commit
4eed151
·
1 Parent(s): 24cb1b4

Feat: Add HuggingFace Inference API as free LLM fallback tier

Browse files

Stage 4 completion - Added 3-tier fallback architecture:
- Tier 1: Gemini 2.0 Flash (free, daily quota)
- Tier 2: HuggingFace Qwen 2.5 72B (free, rate limited) - NEW
- Tier 3: Claude Sonnet 4.5 (paid)
- Tier 4: Keyword matching (deterministic)

Changes:
- Added HF integration to llm_client.py (~150 lines)
- Added HF_TOKEN validation in graph.py
- Updated UI to show HF_TOKEN status in app.py
- Fixed TOOLS schema bug (list → dict format)
- Created comprehensive dev log

Tests: 99/99 passing ✅

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

CHANGELOG.md CHANGED
@@ -1,22 +1,122 @@
1
  # Session Changelog
2
 
3
- **Session Date:** [YYYY-MM-DD]
4
- **Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
5
 
6
  ## Changes Made
7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ### Created Files
9
 
10
- - [file path] - [Purpose/description]
 
 
 
11
 
12
- ### Modified Files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- - [file path] - [What was changed]
15
 
16
- ### Deleted Files
 
 
 
 
 
 
 
17
 
18
- - [file path] - [Reason for deletion]
 
 
19
 
20
  ## Notes
21
 
22
- [Any additional context about the session's work]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Session Changelog
2
 
3
+ **Session Date:** 2026-01-03
4
+ **Dev Record:** dev/dev_260103_16_huggingface_integration.md
5
 
6
  ## Changes Made
7
 
8
+ ### Modified Files
9
+
10
+ - **src/agent/llm_client.py** (~150 lines added)
11
+ - Added `create_hf_client()` - Initialize HuggingFace InferenceClient with HF_TOKEN
12
+ - Added `plan_question_hf(question, available_tools, file_paths)` - Planning with Qwen 2.5 72B
13
+ - Added `select_tools_hf(question, plan, available_tools)` - Function calling with OpenAI-compatible tools format
14
+ - Added `synthesize_answer_hf(question, evidence)` - Answer synthesis from evidence
15
+ - Updated `plan_question()` - Added HuggingFace as middle fallback tier (Gemini → HF → Claude)
16
+ - Updated `select_tools_with_function_calling()` - Added HuggingFace as middle fallback tier
17
+ - Updated `synthesize_answer()` - Added HuggingFace as middle fallback tier
18
+ - Added CONFIG constant: `HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"`
19
+ - Added import: `from huggingface_hub import InferenceClient`
20
+
21
+ - **src/agent/graph.py**
22
+ - Updated `validate_environment()` - Added HF_TOKEN to API key validation check
23
+ - Updated startup logging - Shows ⚠️ WARNING if HF_TOKEN missing
24
+
25
+ - **app.py**
26
+ - Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
27
+ - UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
28
+
29
+ - **src/tools/__init__.py** (Fixed earlier in session)
30
+ - Fixed TOOLS schema bug - Changed parameters from list to dict format
31
+ - Updated all tool definitions to include type/description for each parameter
32
+ - Added `"required_params"` field to specify required parameters
33
+ - Fixed Gemini function calling compatibility
34
+
35
  ### Created Files
36
 
37
+ - **dev/dev_260103_16_huggingface_integration.md**
38
+ - Comprehensive dev log documenting Stage 4 completion and HuggingFace integration
39
+ - Documents 3-tier fallback architecture (Gemini → HuggingFace → Claude)
40
+ - Includes key decisions, learnings, and test results
41
 
42
+ ### No Files Deleted
43
+
44
+ ## Implementation Summary
45
+
46
+ **Stage 4: MVP - Real Integration + HuggingFace Free LLM Fallback**
47
+
48
+ **Goal:** Fix LLM availability issues by adding completely free alternative when Gemini quota exhausted and Claude credits depleted.
49
+
50
+ **Problem Identified:**
51
+ - Gemini 2.0 Flash quota exceeded (1,500 requests/day free tier limit exhausted)
52
+ - Claude Sonnet 4.5 credit balance too low (paid tier, user's balance depleted)
53
+ - Agent falling back to keyword-based tool selection (Stage 4 fallback mechanism)
54
+
55
+ **Solution Implemented:**
56
+ - Added HuggingFace Inference API (Qwen 2.5 72B Instruct) as free middle tier
57
+ - 3-tier fallback chain: Gemini (free, daily quota) → HuggingFace (free, rate limited) → Claude (paid) → Keyword matching
58
+ - All LLM functions updated: planning, tool selection with function calling, answer synthesis
59
 
60
+ **Completed (8/10 Stage 4 tasks):**
61
 
62
+ 1. **Comprehensive Debug Logging** - All nodes log inputs, LLM details, tool execution, state transitions
63
+ 2. ✅ **Improved Error Messages** - answer_node shows specific failure reasons and suggestions
64
+ 3. ✅ **API Key Validation** - Agent startup checks GOOGLE_API_KEY, HF_TOKEN, ANTHROPIC_API_KEY, TAVILY_API_KEY
65
+ 4. ✅ **Tool Execution Error Handling** - execute_node validates tool_calls, handles exceptions gracefully
66
+ 5. ✅ **Fallback Tool Execution** - Keyword matching when LLM function calling fails
67
+ 6. ✅ **LLM Exception Handling** - 3-tier fallback with comprehensive error capture
68
+ 7. ✅ **Diagnostics Display** - Test & Debug tab shows API status, plan, tools, evidence, errors, answer
69
+ 8. ✅ **Documentation** - Dev log created (dev_260103_16_huggingface_integration.md)
70
 
71
+ **Remaining (2/10 tasks):**
72
+ 9. ⏳ Update README with API key setup instructions
73
+ 10. ⏳ Deploy to HF Space and run GAIA validation (target: 5/20 from 0/20)
74
 
75
  ## Notes
76
 
77
+ **Test Results:**
78
+
79
+ All tests passing with 3-tier fallback architecture:
80
+ ```bash
81
+ uv run pytest test/ -q
82
+ ======================== 99 passed, 11 warnings in 51.99s ========================
83
+ ```
84
+
85
+ **Key Technical Achievements:**
86
+
87
+ 1. **3-Tier Fallback Architecture:**
88
+ - Tier 1: Gemini 2.0 Flash (free, 1,500 req/day)
89
+ - Tier 2: HuggingFace Qwen 2.5 72B (free, rate limited) - NEW
90
+ - Tier 3: Claude Sonnet 4.5 (paid, credits)
91
+ - Tier 4: Keyword matching (deterministic fallback)
92
+
93
+ 2. **Function Calling Compatibility:**
94
+ - Gemini: `genai.protos.Tool` with `function_declarations`
95
+ - HuggingFace: OpenAI-compatible tools array format
96
+ - Claude: Anthropic native tools format
97
+ - Single source of truth in `src/tools/__init__.py` with provider-specific transformations
98
+
99
+ 3. **TOOLS Schema Bug Fix:**
100
+ - Changed parameters from list `["query"]` to dict `{"query": {"type": "string", ...}}`
101
+ - Fixed Gemini function calling `'list' object has no attribute 'items'` error
102
+ - All LLM providers now compatible with unified schema
103
+
104
+ **Known Issues (Resolved):**
105
+
106
+ - ✅ Gemini quota exceeded → HuggingFace fallback works
107
+ - ✅ Claude credit balance low → HuggingFace fallback works
108
+ - ✅ TOOLS schema mismatch → Fixed with dict format
109
+
110
+ **Next Steps:**
111
+
112
+ 1. **User:** Set up HF_TOKEN in HuggingFace Space environment variables (in progress)
113
+ 2. **Update README:** Add API key setup instructions for all 4 providers
114
+ 3. **Deploy:** Test with real GAIA validation questions
115
+ 4. **Target:** Achieve 5/20 GAIA questions answered correctly (up from 0/20)
116
+
117
+ **Architectural Improvements Made:**
118
+
119
+ - **Free-first strategy:** Maximize free tier usage before burning paid credits
120
+ - **Diverse quota models:** Daily limits (Gemini) + rate limits (HF) provide better resilience
121
+ - **Function calling standardization:** Single source of truth with provider-specific transformations
122
+ - **Early validation:** Check all API keys at agent startup, not at first use
app.py CHANGED
@@ -4,6 +4,7 @@ import requests
4
  import inspect
5
  import pandas as pd
6
  import logging
 
7
 
8
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
9
  from src.agent import GAIAAgent
@@ -20,6 +21,110 @@ logger = logging.getLogger(__name__)
20
  DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
21
 
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
  # --- GAIA Agent (Replaced BasicAgent) ---
24
  # LangGraph-based agent with sequential workflow
25
  # Stage 1: Placeholder nodes, returns fixed answer
@@ -173,33 +278,88 @@ def run_and_submit_all(profile: gr.OAuthProfile | None):
173
 
174
  # --- Build Gradio Interface using Blocks ---
175
  with gr.Blocks() as demo:
176
- gr.Markdown("# GAIA Agent Evaluation Runner (Stage 1: Foundation)")
177
  gr.Markdown(
178
  """
179
- **Instructions:**
180
-
181
- 1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
182
- 2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
183
- 3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
184
-
185
- ---
186
- **Disclaimers:**
187
- Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
188
- This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
189
  """
190
  )
191
 
192
- gr.LoginButton()
 
 
 
 
193
 
194
- run_button = gr.Button("Run Evaluation & Submit All Answers")
 
 
 
 
 
 
 
195
 
196
- status_output = gr.Textbox(
197
- label="Run Status / Submission Result", lines=5, interactive=False
198
- )
199
- # Removed max_rows=10 from DataFrame constructor
200
- results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
201
 
202
- run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table])
203
 
204
  if __name__ == "__main__":
205
  print("\n" + "-" * 30 + " App Starting " + "-" * 30)
 
4
  import inspect
5
  import pandas as pd
6
  import logging
7
+ import json
8
 
9
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
10
  from src.agent import GAIAAgent
 
21
  DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
22
 
23
 
24
+ # --- Helper Functions ---
25
+ def check_api_keys():
26
+ """Check which API keys are configured."""
27
+ keys_status = {
28
+ "GOOGLE_API_KEY (Gemini)": "✓ SET" if os.getenv("GOOGLE_API_KEY") else "✗ MISSING",
29
+ "HF_TOKEN (HuggingFace)": "✓ SET" if os.getenv("HF_TOKEN") else "✗ MISSING",
30
+ "ANTHROPIC_API_KEY (Claude)": "✓ SET" if os.getenv("ANTHROPIC_API_KEY") else "✗ MISSING",
31
+ "TAVILY_API_KEY (Search)": "✓ SET" if os.getenv("TAVILY_API_KEY") else "✗ MISSING",
32
+ "EXA_API_KEY (Search)": "✓ SET" if os.getenv("EXA_API_KEY") else "✗ MISSING",
33
+ }
34
+ return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
35
+
36
+
37
+ def format_diagnostics(final_state: dict) -> str:
38
+ """Format agent state for diagnostic display."""
39
+ diagnostics = []
40
+
41
+ # Question
42
+ diagnostics.append(f"**Question:** {final_state.get('question', 'N/A')}\n")
43
+
44
+ # Plan
45
+ plan = final_state.get('plan', 'No plan generated')
46
+ diagnostics.append(f"**Plan:**\n{plan}\n")
47
+
48
+ # Tool calls
49
+ tool_calls = final_state.get('tool_calls', [])
50
+ if tool_calls:
51
+ diagnostics.append(f"**Tools Selected:** {len(tool_calls)} tool(s)")
52
+ for idx, tc in enumerate(tool_calls, 1):
53
+ tool_name = tc.get('tool', 'unknown')
54
+ params = tc.get('params', {})
55
+ diagnostics.append(f" {idx}. {tool_name}({params})")
56
+ diagnostics.append("")
57
+ else:
58
+ diagnostics.append("**Tools Selected:** None\n")
59
+
60
+ # Tool results
61
+ tool_results = final_state.get('tool_results', [])
62
+ if tool_results:
63
+ diagnostics.append(f"**Tool Execution Results:** {len(tool_results)} result(s)")
64
+ for idx, tr in enumerate(tool_results, 1):
65
+ tool_name = tr.get('tool', 'unknown')
66
+ status = tr.get('status', 'unknown')
67
+ if status == 'success':
68
+ result_preview = str(tr.get('result', ''))[:100] + "..." if len(str(tr.get('result', ''))) > 100 else str(tr.get('result', ''))
69
+ diagnostics.append(f" {idx}. {tool_name}: ✓ SUCCESS")
70
+ diagnostics.append(f" Result: {result_preview}")
71
+ else:
72
+ error = tr.get('error', 'Unknown error')
73
+ diagnostics.append(f" {idx}. {tool_name}: ✗ FAILED - {error}")
74
+ diagnostics.append("")
75
+
76
+ # Evidence
77
+ evidence = final_state.get('evidence', [])
78
+ if evidence:
79
+ diagnostics.append(f"**Evidence Collected:** {len(evidence)} item(s)")
80
+ for idx, ev in enumerate(evidence, 1):
81
+ ev_preview = ev[:150] + "..." if len(ev) > 150 else ev
82
+ diagnostics.append(f" {idx}. {ev_preview}")
83
+ diagnostics.append("")
84
+ else:
85
+ diagnostics.append("**Evidence Collected:** None\n")
86
+
87
+ # Errors
88
+ errors = final_state.get('errors', [])
89
+ if errors:
90
+ diagnostics.append(f"**Errors:** {len(errors)} error(s)")
91
+ for idx, err in enumerate(errors, 1):
92
+ diagnostics.append(f" {idx}. {err}")
93
+ diagnostics.append("")
94
+
95
+ # Answer
96
+ answer = final_state.get('answer', 'No answer generated')
97
+ diagnostics.append(f"**Final Answer:** {answer}")
98
+
99
+ return "\n".join(diagnostics)
100
+
101
+
102
+ def test_single_question(question: str):
103
+ """Test agent with a single question and return diagnostics."""
104
+ if not question or not question.strip():
105
+ return "Please enter a question.", "", check_api_keys()
106
+
107
+ try:
108
+ # Initialize agent
109
+ agent = GAIAAgent()
110
+
111
+ # Run agent (this stores final_state in agent.last_state)
112
+ answer = agent(question)
113
+
114
+ # Get final state from agent
115
+ final_state = agent.last_state or {}
116
+
117
+ # Format diagnostics
118
+ diagnostics = format_diagnostics(final_state)
119
+ api_status = check_api_keys()
120
+
121
+ return answer, diagnostics, api_status
122
+
123
+ except Exception as e:
124
+ logger.error(f"Error in test_single_question: {e}", exc_info=True)
125
+ return f"ERROR: {str(e)}", f"Exception occurred: {str(e)}", check_api_keys()
126
+
127
+
128
  # --- GAIA Agent (Replaced BasicAgent) ---
129
  # LangGraph-based agent with sequential workflow
130
  # Stage 1: Placeholder nodes, returns fixed answer
 
278
 
279
  # --- Build Gradio Interface using Blocks ---
280
  with gr.Blocks() as demo:
281
+ gr.Markdown("# GAIA Agent Evaluation Runner (Stage 4: MVP - Real Integration)")
282
  gr.Markdown(
283
  """
284
+ **Stage 4 Progress:** Adding diagnostics, error handling, and fallback mechanisms.
 
 
 
 
 
 
 
 
 
285
  """
286
  )
287
 
288
+ with gr.Tabs():
289
+ # Tab 1: Test Single Question (NEW - for diagnostics)
290
+ with gr.Tab("🔍 Test & Debug"):
291
+ gr.Markdown("""
292
+ **Test Mode:** Run the agent on a single question and see detailed diagnostics.
293
 
294
+ This mode shows:
295
+ - API key status
296
+ - Execution plan
297
+ - Tools selected and executed
298
+ - Evidence collected
299
+ - Errors encountered
300
+ - Final answer
301
+ """)
302
 
303
+ test_question_input = gr.Textbox(
304
+ label="Enter Test Question",
305
+ placeholder="e.g., What is the capital of France?",
306
+ lines=3
307
+ )
308
+ test_button = gr.Button("Run Test", variant="primary")
309
+
310
+ with gr.Row():
311
+ with gr.Column(scale=1):
312
+ test_answer_output = gr.Textbox(
313
+ label="Answer",
314
+ lines=3,
315
+ interactive=False
316
+ )
317
+ test_api_status = gr.Textbox(
318
+ label="API Keys Status",
319
+ lines=5,
320
+ interactive=False
321
+ )
322
+ with gr.Column(scale=2):
323
+ test_diagnostics_output = gr.Textbox(
324
+ label="Execution Diagnostics",
325
+ lines=20,
326
+ interactive=False
327
+ )
328
+
329
+ test_button.click(
330
+ fn=test_single_question,
331
+ inputs=[test_question_input],
332
+ outputs=[test_answer_output, test_diagnostics_output, test_api_status]
333
+ )
334
+
335
+ # Tab 2: Full Evaluation (existing functionality)
336
+ with gr.Tab("📊 Full Evaluation"):
337
+ gr.Markdown(
338
+ """
339
+ **Instructions:**
340
+
341
+ 1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
342
+ 2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
343
+ 3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
344
+
345
+ ---
346
+ **Disclaimers:**
347
+ Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
348
+ This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
349
+ """
350
+ )
351
+
352
+ gr.LoginButton()
353
+
354
+ run_button = gr.Button("Run Evaluation & Submit All Answers")
355
+
356
+ status_output = gr.Textbox(
357
+ label="Run Status / Submission Result", lines=5, interactive=False
358
+ )
359
+ # Removed max_rows=10 from DataFrame constructor
360
+ results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
361
 
362
+ run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table])
363
 
364
  if __name__ == "__main__":
365
  print("\n" + "-" * 30 + " App Starting " + "-" * 30)
dev/dev_260103_16_huggingface_integration.md ADDED
@@ -0,0 +1,313 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [dev_260103_16] HuggingFace Inference API Integration
2
+
3
+ **Date:** 2026-01-03
4
+ **Type:** Development
5
+ **Status:** Resolved
6
+ **Related Dev:** dev_260102_15_stage4_mvp_real_integration.md
7
+
8
+ ## Problem Description
9
+
10
+ **Context:** Stage 4 implementation was 7/10 complete with comprehensive diagnostics and error handling. However, testing revealed critical LLM availability issues:
11
+
12
+ 1. **Gemini 2.0 Flash** - Quota exceeded (1,500 requests/day free tier limit exhausted from testing)
13
+ 2. **Claude Sonnet 4.5** - Credit balance too low (paid tier, user's balance depleted)
14
+
15
+ **Root Cause:** Agent relied on only 2 LLM tiers (free Gemini → paid Claude), with no middle fallback when free tier exhausted. This caused complete LLM failure, falling back to keyword-based tool selection (Stage 4 fallback mechanism).
16
+
17
+ **User Request:** Add completely free LLM alternative that works in HuggingFace Spaces environment without requiring local GPU resources.
18
+
19
+ **Requirements:**
20
+ - Must be completely free (no credits, reasonable rate limits)
21
+ - Must support function calling (critical for tool selection)
22
+ - Must work in HuggingFace Spaces (cloud-based, no local GPU)
23
+ - Must integrate into existing 3-tier fallback architecture
24
+
25
+ ---
26
+
27
+ ## Key Decisions
28
+
29
+ ### **Decision 1: HuggingFace Inference API over Ollama (local LLMs)**
30
+
31
+ **Why chosen:**
32
+ - ✅ Works in HuggingFace Spaces (cloud-based API)
33
+ - ✅ Free tier with rate limits (~60 req/min vs Gemini's 1,500 req/day)
34
+ - ✅ Function calling support via OpenAI-compatible API
35
+ - ✅ No GPU requirements (serverless inference)
36
+ - ✅ Already deployed to HF Spaces - logical integration
37
+
38
+ **Rejected alternative: Ollama + Llama 3.1 70B (local)**
39
+ - ❌ Requires local GPU or high-end CPU
40
+ - ❌ Won't work in HuggingFace Free Spaces (CPU-only, 16GB RAM limit)
41
+ - ❌ Would need GPU Spaces upgrade (not free)
42
+ - ❌ Complex setup for user's deployment environment
43
+
44
+ ### **Decision 2: Qwen 2.5 72B Instruct as HuggingFace Model**
45
+
46
+ **Why chosen:**
47
+ - ✅ Excellent function calling capabilities (OpenAI-compatible tools format)
48
+ - ✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
49
+ - ✅ Free on HuggingFace Inference API
50
+ - ✅ 72B parameters - sufficient intelligence for GAIA tasks
51
+
52
+ **Considered alternatives:**
53
+ - `meta-llama/Llama-3.1-70B-Instruct` - Good but slightly worse function calling
54
+ - `NousResearch/Hermes-3-Llama-3.1-70B` - Excellent but less tested for tool use
55
+
56
+ ### **Decision 3: 3-Tier Fallback Architecture**
57
+
58
+ **Final chain:**
59
+ 1. **Gemini 2.0 Flash** (free, 1,500 req/day) - Primary
60
+ 2. **HuggingFace Qwen 2.5 72B** (free, rate limited) - NEW Middle Tier
61
+ 3. **Claude Sonnet 4.5** (paid) - Expensive fallback
62
+ 4. **Keyword matching** (deterministic) - Last resort
63
+
64
+ **Trade-offs:**
65
+ - **Pro:** 4 layers of resilience ensure agent always produces output
66
+ - **Pro:** Maximizes free tier usage before burning paid credits
67
+ - **Con:** Slightly higher latency on fallback chain traversal
68
+ - **Con:** More API keys to manage (but HF_TOKEN already required for Space)
69
+
70
+ ### **Decision 4: TOOLS Schema Bug Fix (Critical)**
71
+
72
+ **Problem discovered:** `src/tools/__init__.py` had parameters as list `["query"]` but LLM client expected dict `{"query": {...}}` with type/description.
73
+
74
+ **Impact:** Gemini function calling was completely broken - caused `'list' object has no attribute 'items'` error.
75
+
76
+ **Fix:** Updated all tool definitions to proper schema:
77
+ ```python
78
+ "parameters": {
79
+ "query": {
80
+ "description": "Search query string",
81
+ "type": "string"
82
+ },
83
+ "max_results": {
84
+ "description": "Maximum number of search results to return",
85
+ "type": "integer"
86
+ }
87
+ },
88
+ "required_params": ["query"]
89
+ ```
90
+
91
+ **Result:** Gemini function calling now working correctly (verified in tests).
92
+
93
+ ---
94
+
95
+ ## Outcome
96
+
97
+ Successfully integrated HuggingFace Inference API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.
98
+
99
+ **Deliverables:**
100
+
101
+ 1. **src/agent/llm_client.py** - Added ~150 lines of HuggingFace integration
102
+ - `create_hf_client()` - Initialize InferenceClient with HF_TOKEN
103
+ - `plan_question_hf()` - Planning using Qwen 2.5 72B
104
+ - `select_tools_hf()` - Function calling with OpenAI-compatible tools format
105
+ - `synthesize_answer_hf()` - Answer synthesis from evidence
106
+ - Updated unified functions: `plan_question()`, `select_tools_with_function_calling()`, `synthesize_answer()` to use 3-tier fallback
107
+
108
+ 2. **src/agent/graph.py** - Added HF_TOKEN validation
109
+ - Updated `validate_environment()` to check HF_TOKEN at agent startup
110
+ - Shows ⚠️ WARNING if HF_TOKEN missing
111
+
112
+ 3. **app.py** - Updated UI to show HF_TOKEN status
113
+ - Added HF_TOKEN to `check_api_keys()` display in Test & Debug tab
114
+
115
+ 4. **src/tools/__init__.py** - Fixed TOOLS schema bug (earlier in session)
116
+ - Changed parameters from list to dict format
117
+ - Added type/description for each parameter
118
+ - Fixed Gemini function calling compatibility
119
+
120
+ **Test Results:**
121
+ ```bash
122
+ uv run pytest test/ -q
123
+ 99 passed, 11 warnings in 51.99s ✅
124
+ ```
125
+
126
+ All tests passing with new 3-tier fallback architecture.
127
+
128
+ **Stage 4 Progress: 8/10 tasks completed**
129
+ - ✅ Comprehensive debug logging
130
+ - ✅ Improved error messages
131
+ - ✅ API key validation (including HF_TOKEN)
132
+ - ✅ Tool execution error handling
133
+ - ✅ Fallback tool execution (keyword matching)
134
+ - ✅ LLM exception handling (3-tier fallback)
135
+ - ✅ Diagnostics display in Gradio UI
136
+ - ✅ Documentation in dev log (this file)
137
+ - ⏳ Update README with API key setup instructions
138
+ - ⏳ Deploy to HF Space and run GAIA validation (5/20 target)
139
+
140
+ ---
141
+
142
+ ## Learnings and Insights
143
+
144
+ ### **Pattern: Free-First Fallback Architecture**
145
+
146
+ **What worked well:**
147
+ - Prioritizing free tiers (Gemini → HuggingFace) before paid tier (Claude) maximizes cost efficiency
148
+ - Multiple free alternatives with different quota models (daily vs rate-limited) provide better resilience than single free tier
149
+ - Keyword fallback ensures agent never completely fails even when all LLMs unavailable
150
+
151
+ **Reusable pattern:**
152
+ ```python
153
+ def unified_llm_function(...):
154
+ """3-tier fallback with comprehensive error capture"""
155
+ errors = []
156
+
157
+ try:
158
+ return free_tier_1(...) # Gemini - daily quota
159
+ except Exception as e1:
160
+ errors.append(f"Tier 1: {e1}")
161
+ try:
162
+ return free_tier_2(...) # HuggingFace - rate limited
163
+ except Exception as e2:
164
+ errors.append(f"Tier 2: {e2}")
165
+ try:
166
+ return paid_tier(...) # Claude - credits
167
+ except Exception as e3:
168
+ errors.append(f"Tier 3: {e3}")
169
+ # Deterministic fallback as last resort
170
+ return keyword_fallback(...)
171
+ ```
172
+
173
+ ### **Pattern: Function Calling Schema Compatibility**
174
+
175
+ **Critical insight:** Different LLM providers require different function calling schemas:
176
+
177
+ 1. **Gemini** - `genai.protos.Tool` with `function_declarations`:
178
+ ```python
179
+ Tool(function_declarations=[
180
+ FunctionDeclaration(
181
+ name="search_web",
182
+ description="...",
183
+ parameters={
184
+ "type": "object",
185
+ "properties": {"query": {"type": "string", "description": "..."}},
186
+ "required": ["query"]
187
+ }
188
+ )
189
+ ])
190
+ ```
191
+
192
+ 2. **HuggingFace** - OpenAI-compatible tools array:
193
+ ```python
194
+ tools = [{
195
+ "type": "function",
196
+ "function": {
197
+ "name": "search_web",
198
+ "description": "...",
199
+ "parameters": {
200
+ "type": "object",
201
+ "properties": {"query": {"type": "string", "description": "..."}},
202
+ "required": ["query"]
203
+ }
204
+ }
205
+ }]
206
+ ```
207
+
208
+ 3. **Claude** - Anthropic native format (simplified):
209
+ ```python
210
+ tools = [{
211
+ "name": "search_web",
212
+ "description": "...",
213
+ "input_schema": {
214
+ "type": "object",
215
+ "properties": {"query": {"type": "string", "description": "..."}},
216
+ "required": ["query"]
217
+ }
218
+ }]
219
+ ```
220
+
221
+ **Best practice:** Maintain single source of truth in `src/tools/__init__.py` with rich schema (dict format with type/description), then transform to provider-specific format in LLM client functions.
222
+
223
+ ### **Pattern: Environment Validation at Startup**
224
+
225
+ **What worked well:**
226
+ - Validating all API keys at agent initialization (not at first use) provides immediate feedback
227
+ - Clear warnings listing missing keys help users diagnose setup issues
228
+ - Non-blocking warnings (continue anyway) allow testing with partial configuration
229
+
230
+ **Implementation:**
231
+ ```python
232
+ def validate_environment() -> List[str]:
233
+ """Check API keys at startup, return list of missing keys"""
234
+ missing = []
235
+ for key_name in ["GOOGLE_API_KEY", "HF_TOKEN", "ANTHROPIC_API_KEY", "TAVILY_API_KEY"]:
236
+ if not os.getenv(key_name):
237
+ missing.append(key_name)
238
+
239
+ if missing:
240
+ logger.warning(f"⚠️ Missing API keys: {', '.join(missing)}")
241
+ else:
242
+ logger.info("✓ All API keys configured")
243
+
244
+ return missing
245
+ ```
246
+
247
+ ### **What to avoid:**
248
+
249
+ **Anti-pattern: List-based parameter schemas**
250
+ ```python
251
+ # WRONG - breaks LLM function calling
252
+ "parameters": ["query", "max_results"]
253
+
254
+ # CORRECT - works with all providers
255
+ "parameters": {
256
+ "query": {"type": "string", "description": "..."},
257
+ "max_results": {"type": "integer", "description": "..."}
258
+ }
259
+ ```
260
+
261
+ **Why it breaks:** LLM clients iterate over `parameters.items()` to extract type/description metadata. List has no `.items()` method.
262
+
263
+ ---
264
+
265
+ ## Changelog
266
+
267
+ **Session Date:** 2026-01-03
268
+
269
+ ### Modified Files
270
+
271
+ 1. **src/agent/llm_client.py** (~150 lines added)
272
+ - Added `create_hf_client()` - Initialize HuggingFace InferenceClient with HF_TOKEN
273
+ - Added `plan_question_hf(question, available_tools, file_paths)` - Planning with Qwen 2.5 72B
274
+ - Added `select_tools_hf(question, plan, available_tools)` - Function calling with OpenAI-compatible tools format
275
+ - Added `synthesize_answer_hf(question, evidence)` - Answer synthesis from evidence
276
+ - Updated `plan_question()` - Added HuggingFace as middle fallback tier (Gemini → HF → Claude)
277
+ - Updated `select_tools_with_function_calling()` - Added HuggingFace as middle fallback tier
278
+ - Updated `synthesize_answer()` - Added HuggingFace as middle fallback tier
279
+ - Added CONFIG constant: `HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"`
280
+ - Added import: `from huggingface_hub import InferenceClient`
281
+
282
+ 2. **src/agent/graph.py**
283
+ - Updated `validate_environment()` - Added HF_TOKEN to API key validation check
284
+ - Updated startup logging - Shows ⚠️ WARNING if HF_TOKEN missing
285
+
286
+ 3. **app.py**
287
+ - Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
288
+ - UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
289
+
290
+ 4. **src/tools/__init__.py** (Fixed earlier in session)
291
+ - Fixed TOOLS schema bug - Changed parameters from list to dict format
292
+ - Updated all tool definitions to include type/description for each parameter
293
+ - Added `"required_params"` field to specify required parameters
294
+ - Fixed Gemini function calling compatibility
295
+
296
+ ### Dependencies
297
+
298
+ **No changes to requirements.txt** - `huggingface-hub>=0.26.0` already present from initial setup.
299
+
300
+ ### Test Results
301
+
302
+ All tests passing with new 3-tier fallback architecture:
303
+ ```bash
304
+ uv run pytest test/ -q
305
+ ======================== 99 passed, 11 warnings in 51.99s ========================
306
+ ```
307
+
308
+ ### Next Steps
309
+
310
+ 1. **User action:** Set up HF_TOKEN in HuggingFace Space environment variables (in progress)
311
+ 2. **Update README:** Add API key setup instructions for all 4 providers (Gemini, HuggingFace, Claude, Tavily)
312
+ 3. **Deploy to HF Space:** Test with real GAIA validation questions
313
+ 4. **Target:** Achieve 5/20 GAIA questions answered correctly (up from 0/20)
src/agent/graph.py CHANGED
@@ -14,11 +14,16 @@ Based on:
14
  """
15
 
16
  import logging
 
17
  from typing import TypedDict, List, Optional
18
  from langgraph.graph import StateGraph, END
19
  from src.config import Settings
20
  from src.tools import TOOLS, search, parse_file, safe_eval, analyze_image
21
- from src.agent.llm_client import plan_question, select_tools_with_function_calling, synthesize_answer
 
 
 
 
22
 
23
  # ============================================================================
24
  # Logging Setup
@@ -29,26 +34,129 @@ logger = logging.getLogger(__name__)
29
  # Agent State Definition
30
  # ============================================================================
31
 
 
32
  class AgentState(TypedDict):
33
  """
34
  State structure for GAIA agent workflow.
35
 
36
  Tracks question processing from input through planning, execution, to final answer.
37
  """
38
- question: str # Input question from GAIA
 
39
  file_paths: Optional[List[str]] # Optional file paths for file-based questions
40
- plan: Optional[str] # Generated execution plan (Stage 3)
41
- tool_calls: List[dict] # Tool invocation tracking (Stage 3)
42
- tool_results: List[dict] # Tool execution results (Stage 3)
43
- evidence: List[str] # Evidence collected from tools (Stage 3)
44
- answer: Optional[str] # Final factoid answer
45
- errors: List[str] # Error messages from failures
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
 
48
  # ============================================================================
49
  # Graph Node Functions (Placeholders for Stage 1)
50
  # ============================================================================
51
 
 
52
  def plan_node(state: AgentState) -> AgentState:
53
  """
54
  Planning node: Analyze question and generate execution plan.
@@ -64,24 +172,30 @@ def plan_node(state: AgentState) -> AgentState:
64
  Returns:
65
  Updated state with execution plan
66
  """
67
- logger.info(f"[plan_node] Question received: {state['question'][:100]}...")
 
 
 
68
 
69
  try:
70
  # Stage 3: Use LLM to generate dynamic execution plan
 
71
  plan = plan_question(
72
  question=state["question"],
73
  available_tools=TOOLS,
74
- file_paths=state.get("file_paths")
75
  )
76
 
77
  state["plan"] = plan
78
- logger.info(f"[plan_node] Plan created ({len(plan)} chars)")
 
79
 
80
  except Exception as e:
81
- logger.error(f"[plan_node] Planning failed: {e}")
82
- state["errors"].append(f"Planning error: {str(e)}")
83
  state["plan"] = "Error: Unable to create plan"
84
 
 
85
  return state
86
 
87
 
@@ -101,35 +215,53 @@ def execute_node(state: AgentState) -> AgentState:
101
  Returns:
102
  Updated state with tool execution results and evidence
103
  """
104
- logger.info(f"[execute_node] Executing tools - Plan: {state['plan'][:100]}...")
 
 
105
 
106
  # Map tool names to actual functions
107
  TOOL_FUNCTIONS = {
108
  "search": search,
109
  "parse_file": parse_file,
110
  "safe_eval": safe_eval,
111
- "analyze_image": analyze_image
112
  }
113
 
 
 
 
 
 
114
  try:
115
  # Stage 3: Use LLM function calling to select tools and extract parameters
 
116
  tool_calls = select_tools_with_function_calling(
117
- question=state["question"],
118
- plan=state["plan"],
119
- available_tools=TOOLS
120
  )
121
 
122
- logger.info(f"[execute_node] LLM selected {len(tool_calls)} tool(s) to execute")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  # Execute each tool call
125
- tool_results = []
126
- evidence = []
127
-
128
- for tool_call in tool_calls:
129
  tool_name = tool_call["tool"]
130
  params = tool_call["params"]
131
 
132
- logger.info(f"[execute_node] Executing {tool_name} with params: {params}")
 
133
 
134
  try:
135
  # Get tool function
@@ -138,42 +270,84 @@ def execute_node(state: AgentState) -> AgentState:
138
  raise ValueError(f"Tool '{tool_name}' not found in TOOL_FUNCTIONS")
139
 
140
  # Execute tool
 
141
  result = tool_func(**params)
 
 
142
 
143
  # Store result
144
- tool_results.append({
145
- "tool": tool_name,
146
- "params": params,
147
- "result": result,
148
- "status": "success"
149
- })
 
 
150
 
151
  # Extract evidence
152
  evidence.append(f"[{tool_name}] {result}")
153
 
154
- logger.info(f"[execute_node] {tool_name} executed successfully")
155
-
156
  except Exception as tool_error:
157
- logger.error(f"[execute_node] Tool {tool_name} failed: {tool_error}")
158
- tool_results.append({
159
- "tool": tool_name,
160
- "params": params,
161
- "error": str(tool_error),
162
- "status": "failed"
163
- })
164
- state["errors"].append(f"Tool {tool_name} failed: {str(tool_error)}")
165
-
166
- # Update state
167
- state["tool_calls"] = tool_calls
168
- state["tool_results"] = tool_results
169
- state["evidence"] = evidence
170
-
171
- logger.info(f"[execute_node] Executed {len(tool_results)} tool(s), collected {len(evidence)} evidence items")
172
 
173
  except Exception as e:
174
- logger.error(f"[execute_node] Execution failed: {e}")
175
- state["errors"].append(f"Execution error: {str(e)}")
176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  return state
178
 
179
 
@@ -192,29 +366,40 @@ def answer_node(state: AgentState) -> AgentState:
192
  Returns:
193
  Updated state with final factoid answer
194
  """
195
- logger.info(f"[answer_node] Processing {len(state['evidence'])} evidence items")
 
 
 
 
 
196
 
197
  try:
198
  # Check if we have evidence
199
  if not state["evidence"]:
200
- logger.warning("[answer_node] No evidence collected, cannot generate answer")
201
- state["answer"] = "Unable to answer: No evidence collected"
 
 
 
 
 
202
  return state
203
 
204
  # Stage 3: Use LLM to synthesize factoid answer from evidence
 
205
  answer = synthesize_answer(
206
- question=state["question"],
207
- evidence=state["evidence"]
208
  )
209
 
210
  state["answer"] = answer
211
- logger.info(f"[answer_node] Answer generated: {answer}")
212
 
213
  except Exception as e:
214
- logger.error(f"[answer_node] Answer synthesis failed: {e}")
215
- state["errors"].append(f"Answer synthesis error: {str(e)}")
216
- state["answer"] = "Error: Unable to generate answer"
217
 
 
218
  return state
219
 
220
 
@@ -222,6 +407,7 @@ def answer_node(state: AgentState) -> AgentState:
222
  # StateGraph Construction
223
  # ============================================================================
224
 
 
225
  def create_gaia_graph() -> StateGraph:
226
  """
227
  Create LangGraph StateGraph for GAIA agent.
@@ -259,6 +445,7 @@ def create_gaia_graph() -> StateGraph:
259
  # Agent Wrapper Class
260
  # ============================================================================
261
 
 
262
  class GAIAAgent:
263
  """
264
  GAIA Benchmark Agent - Main interface.
@@ -270,7 +457,19 @@ class GAIAAgent:
270
  def __init__(self):
271
  """Initialize agent and compile StateGraph."""
272
  print("GAIAAgent initializing...")
 
 
 
 
 
 
 
 
 
 
 
273
  self.graph = create_gaia_graph()
 
274
  print("GAIAAgent initialized successfully")
275
 
276
  def __call__(self, question: str) -> str:
@@ -294,12 +493,15 @@ class GAIAAgent:
294
  "tool_results": [],
295
  "evidence": [],
296
  "answer": None,
297
- "errors": []
298
  }
299
 
300
  # Invoke graph
301
  final_state = self.graph.invoke(initial_state)
302
 
 
 
 
303
  # Extract answer
304
  answer = final_state.get("answer", "Error: No answer generated")
305
  print(f"GAIAAgent returning answer: {answer}")
 
14
  """
15
 
16
  import logging
17
+ import os
18
  from typing import TypedDict, List, Optional
19
  from langgraph.graph import StateGraph, END
20
  from src.config import Settings
21
  from src.tools import TOOLS, search, parse_file, safe_eval, analyze_image
22
+ from src.agent.llm_client import (
23
+ plan_question,
24
+ select_tools_with_function_calling,
25
+ synthesize_answer,
26
+ )
27
 
28
  # ============================================================================
29
  # Logging Setup
 
34
  # Agent State Definition
35
  # ============================================================================
36
 
37
+
38
  class AgentState(TypedDict):
39
  """
40
  State structure for GAIA agent workflow.
41
 
42
  Tracks question processing from input through planning, execution, to final answer.
43
  """
44
+
45
+ question: str # Input question from GAIA
46
  file_paths: Optional[List[str]] # Optional file paths for file-based questions
47
+ plan: Optional[str] # Generated execution plan (Stage 3)
48
+ tool_calls: List[dict] # Tool invocation tracking (Stage 3)
49
+ tool_results: List[dict] # Tool execution results (Stage 3)
50
+ evidence: List[str] # Evidence collected from tools (Stage 3)
51
+ answer: Optional[str] # Final factoid answer
52
+ errors: List[str] # Error messages from failures
53
+
54
+
55
+ # ============================================================================
56
+ # Environment Validation
57
+ # ============================================================================
58
+
59
+
60
+ def validate_environment() -> List[str]:
61
+ """
62
+ Check which API keys are available at startup.
63
+
64
+ Returns:
65
+ List of missing API key names (empty if all present)
66
+ """
67
+ missing = []
68
+ if not os.getenv("GOOGLE_API_KEY"):
69
+ missing.append("GOOGLE_API_KEY (Gemini)")
70
+ if not os.getenv("HF_TOKEN"):
71
+ missing.append("HF_TOKEN (HuggingFace)")
72
+ if not os.getenv("ANTHROPIC_API_KEY"):
73
+ missing.append("ANTHROPIC_API_KEY (Claude)")
74
+ if not os.getenv("TAVILY_API_KEY"):
75
+ missing.append("TAVILY_API_KEY (Search)")
76
+ return missing
77
+
78
+
79
+ # ============================================================================
80
+ # Helper Functions
81
+ # ============================================================================
82
+
83
+
84
+ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
85
+ """
86
+ MVP Fallback: Simple keyword-based tool selection when LLM fails.
87
+
88
+ This is a temporary hack to get basic functionality working.
89
+ Uses simple keyword matching to select tools.
90
+
91
+ Args:
92
+ question: The user question
93
+ plan: The execution plan
94
+
95
+ Returns:
96
+ List of tool calls with basic parameters
97
+ """
98
+ logger.info("[fallback_tool_selection] Using keyword-based fallback for tool selection")
99
+
100
+ tool_calls = []
101
+ question_lower = question.lower()
102
+ plan_lower = plan.lower()
103
+ combined = f"{question_lower} {plan_lower}"
104
+
105
+ # Search tool: keywords like "search", "find", "look up", "who", "what", "when", "where"
106
+ search_keywords = ["search", "find", "look up", "who is", "what is", "when", "where", "google"]
107
+ if any(keyword in combined for keyword in search_keywords):
108
+ # Extract search query - use first sentence or full question
109
+ query = question.split('.')[0] if '.' in question else question
110
+ tool_calls.append({
111
+ "tool": "search",
112
+ "params": {"query": query}
113
+ })
114
+ logger.info(f"[fallback_tool_selection] Added search tool with query: {query}")
115
+
116
+ # Math tool: keywords like "calculate", "compute", "+", "-", "*", "/", "="
117
+ math_keywords = ["calculate", "compute", "math", "sum", "multiply", "divide", "+", "-", "*", "/", "="]
118
+ if any(keyword in combined for keyword in math_keywords):
119
+ # Try to extract expression - look for patterns with numbers and operators
120
+ import re
121
+ # Look for mathematical expressions
122
+ expr_match = re.search(r'[\d\s\+\-\*/\(\)\.]+', question)
123
+ if expr_match:
124
+ expression = expr_match.group().strip()
125
+ tool_calls.append({
126
+ "tool": "safe_eval",
127
+ "params": {"expression": expression}
128
+ })
129
+ logger.info(f"[fallback_tool_selection] Added safe_eval tool with expression: {expression}")
130
+
131
+ # File tool: keywords like "file", "parse", "read", "csv", "json", "txt"
132
+ file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
133
+ if any(keyword in combined for keyword in file_keywords):
134
+ # Cannot extract filename without more info, skip for now
135
+ logger.warning("[fallback_tool_selection] File operation detected but cannot extract filename")
136
+
137
+ # Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
138
+ image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
139
+ if any(keyword in combined for keyword in image_keywords):
140
+ # Cannot extract image path without more info, skip for now
141
+ logger.warning("[fallback_tool_selection] Image operation detected but cannot extract image path")
142
+
143
+ if not tool_calls:
144
+ logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
145
+ # Default: just search the question
146
+ tool_calls.append({
147
+ "tool": "search",
148
+ "params": {"query": question}
149
+ })
150
+
151
+ logger.info(f"[fallback_tool_selection] Fallback selected {len(tool_calls)} tool(s)")
152
+ return tool_calls
153
 
154
 
155
  # ============================================================================
156
  # Graph Node Functions (Placeholders for Stage 1)
157
  # ============================================================================
158
 
159
+
160
  def plan_node(state: AgentState) -> AgentState:
161
  """
162
  Planning node: Analyze question and generate execution plan.
 
172
  Returns:
173
  Updated state with execution plan
174
  """
175
+ logger.info(f"[plan_node] ========== PLAN NODE START ==========")
176
+ logger.info(f"[plan_node] Question: {state['question']}")
177
+ logger.info(f"[plan_node] File paths: {state.get('file_paths')}")
178
+ logger.info(f"[plan_node] Available tools: {list(TOOLS.keys())}")
179
 
180
  try:
181
  # Stage 3: Use LLM to generate dynamic execution plan
182
+ logger.info(f"[plan_node] Calling plan_question() with LLM...")
183
  plan = plan_question(
184
  question=state["question"],
185
  available_tools=TOOLS,
186
+ file_paths=state.get("file_paths"),
187
  )
188
 
189
  state["plan"] = plan
190
+ logger.info(f"[plan_node] Plan created successfully ({len(plan)} chars)")
191
+ logger.debug(f"[plan_node] Plan content: {plan}")
192
 
193
  except Exception as e:
194
+ logger.error(f"[plan_node] Planning failed: {type(e).__name__}: {str(e)}", exc_info=True)
195
+ state["errors"].append(f"Planning error: {type(e).__name__}: {str(e)}")
196
  state["plan"] = "Error: Unable to create plan"
197
 
198
+ logger.info(f"[plan_node] ========== PLAN NODE END ==========")
199
  return state
200
 
201
 
 
215
  Returns:
216
  Updated state with tool execution results and evidence
217
  """
218
+ logger.info(f"[execute_node] ========== EXECUTE NODE START ==========")
219
+ logger.info(f"[execute_node] Plan: {state['plan']}")
220
+ logger.info(f"[execute_node] Question: {state['question']}")
221
 
222
  # Map tool names to actual functions
223
  TOOL_FUNCTIONS = {
224
  "search": search,
225
  "parse_file": parse_file,
226
  "safe_eval": safe_eval,
227
+ "analyze_image": analyze_image,
228
  }
229
 
230
+ # Initialize results lists
231
+ tool_results = []
232
+ evidence = []
233
+ tool_calls = []
234
+
235
  try:
236
  # Stage 3: Use LLM function calling to select tools and extract parameters
237
+ logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
238
  tool_calls = select_tools_with_function_calling(
239
+ question=state["question"], plan=state["plan"], available_tools=TOOLS
 
 
240
  )
241
 
242
+ # Validate tool_calls result
243
+ if not tool_calls:
244
+ logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
245
+ state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
246
+ # MVP HACK: Use fallback keyword-based tool selection
247
+ tool_calls = fallback_tool_selection(state["question"], state["plan"])
248
+ logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
249
+ elif not isinstance(tool_calls, list):
250
+ logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
251
+ state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
252
+ # MVP HACK: Use fallback
253
+ tool_calls = fallback_tool_selection(state["question"], state["plan"])
254
+ else:
255
+ logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
256
+ logger.debug(f"[execute_node] Tool calls: {tool_calls}")
257
 
258
  # Execute each tool call
259
+ for idx, tool_call in enumerate(tool_calls, 1):
 
 
 
260
  tool_name = tool_call["tool"]
261
  params = tool_call["params"]
262
 
263
+ logger.info(f"[execute_node] --- Tool {idx}/{len(tool_calls)}: {tool_name} ---")
264
+ logger.info(f"[execute_node] Parameters: {params}")
265
 
266
  try:
267
  # Get tool function
 
270
  raise ValueError(f"Tool '{tool_name}' not found in TOOL_FUNCTIONS")
271
 
272
  # Execute tool
273
+ logger.info(f"[execute_node] Executing {tool_name}...")
274
  result = tool_func(**params)
275
+ logger.info(f"[execute_node] ✓ {tool_name} completed successfully")
276
+ logger.debug(f"[execute_node] Result: {result[:200] if isinstance(result, str) else result}...")
277
 
278
  # Store result
279
+ tool_results.append(
280
+ {
281
+ "tool": tool_name,
282
+ "params": params,
283
+ "result": result,
284
+ "status": "success",
285
+ }
286
+ )
287
 
288
  # Extract evidence
289
  evidence.append(f"[{tool_name}] {result}")
290
 
 
 
291
  except Exception as tool_error:
292
+ logger.error(f"[execute_node] Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
293
+ tool_results.append(
294
+ {
295
+ "tool": tool_name,
296
+ "params": params,
297
+ "error": str(tool_error),
298
+ "status": "failed",
299
+ }
300
+ )
301
+ state["errors"].append(f"Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}")
302
+
303
+ logger.info(f"[execute_node] Summary: {len(tool_results)} tool(s) executed, {len(evidence)} evidence items collected")
304
+ logger.debug(f"[execute_node] Evidence: {evidence}")
 
 
305
 
306
  except Exception as e:
307
+ logger.error(f"[execute_node] Execution failed: {type(e).__name__}: {str(e)}", exc_info=True)
308
+ state["errors"].append(f"Execution error: {type(e).__name__}: {str(e)}")
309
 
310
+ # Try fallback if we don't have any tool_calls yet
311
+ if not tool_calls:
312
+ logger.info(f"[execute_node] Attempting fallback after exception...")
313
+ try:
314
+ tool_calls = fallback_tool_selection(state["question"], state.get("plan", ""))
315
+ logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
316
+
317
+ # Try to execute fallback tools
318
+ TOOL_FUNCTIONS = {
319
+ "search": search,
320
+ "parse_file": parse_file,
321
+ "safe_eval": safe_eval,
322
+ "analyze_image": analyze_image,
323
+ }
324
+
325
+ for tool_call in tool_calls:
326
+ try:
327
+ tool_name = tool_call["tool"]
328
+ params = tool_call["params"]
329
+ tool_func = TOOL_FUNCTIONS.get(tool_name)
330
+ if tool_func:
331
+ result = tool_func(**params)
332
+ tool_results.append({
333
+ "tool": tool_name,
334
+ "params": params,
335
+ "result": result,
336
+ "status": "success"
337
+ })
338
+ evidence.append(f"[{tool_name}] {result}")
339
+ logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
340
+ except Exception as tool_error:
341
+ logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
342
+ except Exception as fallback_error:
343
+ logger.error(f"[execute_node] Fallback also failed: {fallback_error}")
344
+
345
+ # Always update state, even if there were errors
346
+ state["tool_calls"] = tool_calls
347
+ state["tool_results"] = tool_results
348
+ state["evidence"] = evidence
349
+
350
+ logger.info(f"[execute_node] ========== EXECUTE NODE END ==========")
351
  return state
352
 
353
 
 
366
  Returns:
367
  Updated state with final factoid answer
368
  """
369
+ logger.info(f"[answer_node] ========== ANSWER NODE START ==========")
370
+ logger.info(f"[answer_node] Evidence items collected: {len(state['evidence'])}")
371
+ logger.debug(f"[answer_node] Evidence: {state['evidence']}")
372
+ logger.info(f"[answer_node] Errors accumulated: {len(state['errors'])}")
373
+ if state["errors"]:
374
+ logger.warning(f"[answer_node] Error list: {state['errors']}")
375
 
376
  try:
377
  # Check if we have evidence
378
  if not state["evidence"]:
379
+ logger.warning(
380
+ "[answer_node] No evidence collected, cannot generate answer"
381
+ )
382
+ # Show WHY it failed - include error details
383
+ error_summary = "; ".join(state["errors"]) if state["errors"] else "No errors logged - check API keys and logs"
384
+ state["answer"] = f"ERROR: No evidence collected. Details: {error_summary}"
385
+ logger.error(f"[answer_node] Returning error answer: {state['answer']}")
386
  return state
387
 
388
  # Stage 3: Use LLM to synthesize factoid answer from evidence
389
+ logger.info(f"[answer_node] Calling synthesize_answer() with {len(state['evidence'])} evidence items...")
390
  answer = synthesize_answer(
391
+ question=state["question"], evidence=state["evidence"]
 
392
  )
393
 
394
  state["answer"] = answer
395
+ logger.info(f"[answer_node] Answer generated successfully: {answer}")
396
 
397
  except Exception as e:
398
+ logger.error(f"[answer_node] Answer synthesis failed: {type(e).__name__}: {str(e)}", exc_info=True)
399
+ state["errors"].append(f"Answer synthesis error: {type(e).__name__}: {str(e)}")
400
+ state["answer"] = f"ERROR: Answer synthesis failed - {type(e).__name__}: {str(e)}"
401
 
402
+ logger.info(f"[answer_node] ========== ANSWER NODE END ==========")
403
  return state
404
 
405
 
 
407
  # StateGraph Construction
408
  # ============================================================================
409
 
410
+
411
  def create_gaia_graph() -> StateGraph:
412
  """
413
  Create LangGraph StateGraph for GAIA agent.
 
445
  # Agent Wrapper Class
446
  # ============================================================================
447
 
448
+
449
  class GAIAAgent:
450
  """
451
  GAIA Benchmark Agent - Main interface.
 
457
  def __init__(self):
458
  """Initialize agent and compile StateGraph."""
459
  print("GAIAAgent initializing...")
460
+
461
+ # Validate environment - check API keys
462
+ missing_keys = validate_environment()
463
+ if missing_keys:
464
+ warning_msg = f"⚠️ WARNING: Missing API keys: {', '.join(missing_keys)}"
465
+ print(warning_msg)
466
+ logger.warning(warning_msg)
467
+ print(" Agent may fail to answer questions. Set keys in environment variables.")
468
+ else:
469
+ print("✓ All API keys present")
470
+
471
  self.graph = create_gaia_graph()
472
+ self.last_state = None # Store last execution state for diagnostics
473
  print("GAIAAgent initialized successfully")
474
 
475
  def __call__(self, question: str) -> str:
 
493
  "tool_results": [],
494
  "evidence": [],
495
  "answer": None,
496
+ "errors": [],
497
  }
498
 
499
  # Invoke graph
500
  final_state = self.graph.invoke(initial_state)
501
 
502
+ # Store state for diagnostics
503
+ self.last_state = final_state
504
+
505
  # Extract answer
506
  answer = final_state.get("answer", "Error: No answer generated")
507
  print(f"GAIAAgent returning answer: {answer}")
src/agent/llm_client.py CHANGED
@@ -19,6 +19,7 @@ import logging
19
  from typing import List, Dict, Optional, Any
20
  from anthropic import Anthropic
21
  import google.generativeai as genai
 
22
 
23
  # ============================================================================
24
  # CONFIG
@@ -30,6 +31,10 @@ CLAUDE_MODEL = "claude-sonnet-4-5-20250929"
30
  # Gemini Configuration
31
  GEMINI_MODEL = "gemini-2.0-flash-exp"
32
 
 
 
 
 
33
  # Shared Configuration
34
  TEMPERATURE = 0 # Deterministic for factoid answers
35
  MAX_TOKENS = 4096
@@ -64,6 +69,16 @@ def create_gemini_client():
64
  return genai.GenerativeModel(GEMINI_MODEL)
65
 
66
 
 
 
 
 
 
 
 
 
 
 
67
  # ============================================================================
68
  # Planning Functions - Claude Implementation
69
  # ============================================================================
@@ -186,6 +201,75 @@ Create an execution plan to answer this question. Format as numbered steps."""
186
  return plan
187
 
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
  def plan_question(
190
  question: str,
191
  available_tools: Dict[str, Dict],
@@ -194,8 +278,8 @@ def plan_question(
194
  """
195
  Analyze question and generate execution plan using LLM.
196
 
197
- Pattern: Try Gemini first (free tier), fallback to Claude if fails.
198
- Matches Stage 2 tool pattern (free primary, paid fallback).
199
 
200
  Args:
201
  question: GAIA question text
@@ -208,12 +292,16 @@ def plan_question(
208
  try:
209
  return plan_question_gemini(question, available_tools, file_paths)
210
  except Exception as gemini_error:
211
- logger.warning(f"[plan_question] Gemini failed: {gemini_error}, trying Claude fallback")
212
  try:
213
- return plan_question_claude(question, available_tools, file_paths)
214
- except Exception as claude_error:
215
- logger.error(f"[plan_question] Both LLMs failed. Gemini: {gemini_error}, Claude: {claude_error}")
216
- raise Exception(f"Planning failed with both LLMs. Gemini: {gemini_error}, Claude: {claude_error}")
 
 
 
 
217
 
218
 
219
  # ============================================================================
@@ -351,6 +439,89 @@ Select and call the tools needed to answer this question according to the plan."
351
  return tool_calls
352
 
353
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
354
  def select_tools_with_function_calling(
355
  question: str,
356
  plan: str,
@@ -359,7 +530,8 @@ def select_tools_with_function_calling(
359
  """
360
  Use LLM function calling to dynamically select tools and extract parameters.
361
 
362
- Pattern: Try Gemini first (free tier), fallback to Claude if fails.
 
363
 
364
  Args:
365
  question: GAIA question text
@@ -372,12 +544,16 @@ def select_tools_with_function_calling(
372
  try:
373
  return select_tools_gemini(question, plan, available_tools)
374
  except Exception as gemini_error:
375
- logger.warning(f"[select_tools] Gemini failed: {gemini_error}, trying Claude fallback")
376
  try:
377
- return select_tools_claude(question, plan, available_tools)
378
- except Exception as claude_error:
379
- logger.error(f"[select_tools] Both LLMs failed. Gemini: {gemini_error}, Claude: {claude_error}")
380
- raise Exception(f"Tool selection failed with both LLMs. Gemini: {gemini_error}, Claude: {claude_error}")
 
 
 
 
381
 
382
 
383
  # ============================================================================
@@ -495,6 +671,71 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
495
  return answer
496
 
497
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
498
  def synthesize_answer(
499
  question: str,
500
  evidence: List[str]
@@ -502,7 +743,8 @@ def synthesize_answer(
502
  """
503
  Synthesize factoid answer from collected evidence using LLM.
504
 
505
- Pattern: Try Gemini first (free tier), fallback to Claude if fails.
 
506
 
507
  Args:
508
  question: Original GAIA question
@@ -514,12 +756,16 @@ def synthesize_answer(
514
  try:
515
  return synthesize_answer_gemini(question, evidence)
516
  except Exception as gemini_error:
517
- logger.warning(f"[synthesize_answer] Gemini failed: {gemini_error}, trying Claude fallback")
518
  try:
519
- return synthesize_answer_claude(question, evidence)
520
- except Exception as claude_error:
521
- logger.error(f"[synthesize_answer] Both LLMs failed. Gemini: {gemini_error}, Claude: {claude_error}")
522
- raise Exception(f"Answer synthesis failed with both LLMs. Gemini: {gemini_error}, Claude: {claude_error}")
 
 
 
 
523
 
524
 
525
  # ============================================================================
 
19
  from typing import List, Dict, Optional, Any
20
  from anthropic import Anthropic
21
  import google.generativeai as genai
22
+ from huggingface_hub import InferenceClient
23
 
24
  # ============================================================================
25
  # CONFIG
 
31
  # Gemini Configuration
32
  GEMINI_MODEL = "gemini-2.0-flash-exp"
33
 
34
+ # HuggingFace Configuration
35
+ HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # Excellent for function calling and reasoning
36
+ # Alternatives: "meta-llama/Llama-3.1-70B-Instruct", "NousResearch/Hermes-3-Llama-3.1-70B"
37
+
38
  # Shared Configuration
39
  TEMPERATURE = 0 # Deterministic for factoid answers
40
  MAX_TOKENS = 4096
 
69
  return genai.GenerativeModel(GEMINI_MODEL)
70
 
71
 
72
+ def create_hf_client() -> InferenceClient:
73
+ """Initialize HuggingFace Inference API client with token from environment."""
74
+ hf_token = os.getenv("HF_TOKEN")
75
+ if not hf_token:
76
+ raise ValueError("HF_TOKEN environment variable not set")
77
+
78
+ logger.info(f"Initializing HuggingFace Inference client with model: {HF_MODEL}")
79
+ return InferenceClient(model=HF_MODEL, token=hf_token)
80
+
81
+
82
  # ============================================================================
83
  # Planning Functions - Claude Implementation
84
  # ============================================================================
 
201
  return plan
202
 
203
 
204
+ # ============================================================================
205
+ # Planning Functions - HuggingFace Implementation
206
+ # ============================================================================
207
+
208
+ def plan_question_hf(
209
+ question: str,
210
+ available_tools: Dict[str, Dict],
211
+ file_paths: Optional[List[str]] = None
212
+ ) -> str:
213
+ """Analyze question and generate execution plan using HuggingFace Inference API."""
214
+ client = create_hf_client()
215
+
216
+ # Format tool information
217
+ tool_descriptions = []
218
+ for name, info in available_tools.items():
219
+ tool_descriptions.append(
220
+ f"- {name}: {info['description']} (Category: {info['category']})"
221
+ )
222
+ tools_text = "\n".join(tool_descriptions)
223
+
224
+ # File context
225
+ file_context = ""
226
+ if file_paths:
227
+ file_context = f"\n\nAvailable files:\n" + "\n".join([f"- {fp}" for fp in file_paths])
228
+
229
+ # System message for Qwen 2.5 (supports system/user format)
230
+ system_prompt = """You are a planning agent for answering complex questions.
231
+
232
+ Your task is to analyze the question and create a step-by-step execution plan.
233
+
234
+ Consider:
235
+ 1. What information is needed to answer the question?
236
+ 2. Which tools can provide that information?
237
+ 3. In what order should tools be executed?
238
+ 4. What parameters need to be extracted from the question?
239
+
240
+ Generate a concise plan with numbered steps."""
241
+
242
+ user_prompt = f"""Question: {question}{file_context}
243
+
244
+ Available tools:
245
+ {tools_text}
246
+
247
+ Create an execution plan to answer this question. Format as numbered steps."""
248
+
249
+ logger.info(f"[plan_question_hf] Calling HuggingFace ({HF_MODEL}) for planning")
250
+
251
+ # HuggingFace Inference API chat completion
252
+ messages = [
253
+ {"role": "system", "content": system_prompt},
254
+ {"role": "user", "content": user_prompt}
255
+ ]
256
+
257
+ response = client.chat_completion(
258
+ messages=messages,
259
+ max_tokens=MAX_TOKENS,
260
+ temperature=TEMPERATURE
261
+ )
262
+
263
+ plan = response.choices[0].message.content
264
+ logger.info(f"[plan_question_hf] Generated plan ({len(plan)} chars)")
265
+
266
+ return plan
267
+
268
+
269
+ # ============================================================================
270
+ # Unified Planning Function with Fallback Chain
271
+ # ============================================================================
272
+
273
  def plan_question(
274
  question: str,
275
  available_tools: Dict[str, Dict],
 
278
  """
279
  Analyze question and generate execution plan using LLM.
280
 
281
+ Pattern: Try Gemini first (free tier), HuggingFace (free tier), then Claude (paid) if both fail.
282
+ 3-tier fallback ensures availability even with quota limits.
283
 
284
  Args:
285
  question: GAIA question text
 
292
  try:
293
  return plan_question_gemini(question, available_tools, file_paths)
294
  except Exception as gemini_error:
295
+ logger.warning(f"[plan_question] Gemini failed: {gemini_error}, trying HuggingFace fallback")
296
  try:
297
+ return plan_question_hf(question, available_tools, file_paths)
298
+ except Exception as hf_error:
299
+ logger.warning(f"[plan_question] HuggingFace failed: {hf_error}, trying Claude fallback")
300
+ try:
301
+ return plan_question_claude(question, available_tools, file_paths)
302
+ except Exception as claude_error:
303
+ logger.error(f"[plan_question] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
304
+ raise Exception(f"Planning failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
305
 
306
 
307
  # ============================================================================
 
439
  return tool_calls
440
 
441
 
442
+ # ============================================================================
443
+ # Tool Selection - HuggingFace Implementation
444
+ # ============================================================================
445
+
446
+ def select_tools_hf(
447
+ question: str,
448
+ plan: str,
449
+ available_tools: Dict[str, Dict]
450
+ ) -> List[Dict[str, Any]]:
451
+ """Use HuggingFace Inference API with function calling to select tools and extract parameters."""
452
+ client = create_hf_client()
453
+
454
+ # Convert tool registry to OpenAI-compatible tool schema (HF uses same format)
455
+ tools = []
456
+ for name, info in available_tools.items():
457
+ tool_schema = {
458
+ "type": "function",
459
+ "function": {
460
+ "name": name,
461
+ "description": info["description"],
462
+ "parameters": {
463
+ "type": "object",
464
+ "properties": {},
465
+ "required": info.get("required_params", [])
466
+ }
467
+ }
468
+ }
469
+
470
+ # Add parameter schemas
471
+ for param_name, param_info in info.get("parameters", {}).items():
472
+ tool_schema["function"]["parameters"]["properties"][param_name] = {
473
+ "type": param_info.get("type", "string"),
474
+ "description": param_info.get("description", "")
475
+ }
476
+
477
+ tools.append(tool_schema)
478
+
479
+ system_prompt = f"""You are a tool selection agent. Based on the question and execution plan, select appropriate tools to use.
480
+
481
+ Execute the plan step by step. Call the necessary tools with correct parameters extracted from the question.
482
+
483
+ Plan:
484
+ {plan}"""
485
+
486
+ user_prompt = f"""Question: {question}
487
+
488
+ Select and call the tools needed to answer this question according to the plan."""
489
+
490
+ logger.info(f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools")
491
+
492
+ messages = [
493
+ {"role": "system", "content": system_prompt},
494
+ {"role": "user", "content": user_prompt}
495
+ ]
496
+
497
+ # HuggingFace Inference API with tools parameter
498
+ response = client.chat_completion(
499
+ messages=messages,
500
+ tools=tools,
501
+ max_tokens=MAX_TOKENS,
502
+ temperature=TEMPERATURE
503
+ )
504
+
505
+ # Extract tool calls from response
506
+ tool_calls = []
507
+ if hasattr(response.choices[0].message, 'tool_calls') and response.choices[0].message.tool_calls:
508
+ for tool_call in response.choices[0].message.tool_calls:
509
+ import json
510
+ tool_calls.append({
511
+ "tool": tool_call.function.name,
512
+ "params": json.loads(tool_call.function.arguments),
513
+ "id": tool_call.id
514
+ })
515
+
516
+ logger.info(f"[select_tools_hf] HuggingFace selected {len(tool_calls)} tool(s)")
517
+
518
+ return tool_calls
519
+
520
+
521
+ # ============================================================================
522
+ # Unified Tool Selection with Fallback Chain
523
+ # ============================================================================
524
+
525
  def select_tools_with_function_calling(
526
  question: str,
527
  plan: str,
 
530
  """
531
  Use LLM function calling to dynamically select tools and extract parameters.
532
 
533
+ Pattern: Try Gemini first (free tier), HuggingFace (free tier), then Claude (paid) if both fail.
534
+ 3-tier fallback ensures availability even with quota limits.
535
 
536
  Args:
537
  question: GAIA question text
 
544
  try:
545
  return select_tools_gemini(question, plan, available_tools)
546
  except Exception as gemini_error:
547
+ logger.warning(f"[select_tools] Gemini failed: {gemini_error}, trying HuggingFace fallback")
548
  try:
549
+ return select_tools_hf(question, plan, available_tools)
550
+ except Exception as hf_error:
551
+ logger.warning(f"[select_tools] HuggingFace failed: {hf_error}, trying Claude fallback")
552
+ try:
553
+ return select_tools_claude(question, plan, available_tools)
554
+ except Exception as claude_error:
555
+ logger.error(f"[select_tools] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
556
+ raise Exception(f"Tool selection failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
557
 
558
 
559
  # ============================================================================
 
671
  return answer
672
 
673
 
674
+ # ============================================================================
675
+ # Answer Synthesis - HuggingFace Implementation
676
+ # ============================================================================
677
+
678
+ def synthesize_answer_hf(
679
+ question: str,
680
+ evidence: List[str]
681
+ ) -> str:
682
+ """Synthesize factoid answer from evidence using HuggingFace Inference API."""
683
+ client = create_hf_client()
684
+
685
+ # Format evidence
686
+ evidence_text = "\n\n".join([f"Evidence {i+1}:\n{e}" for i, e in enumerate(evidence)])
687
+
688
+ system_prompt = """You are an answer synthesis agent for the GAIA benchmark.
689
+
690
+ Your task is to extract a factoid answer from the provided evidence.
691
+
692
+ CRITICAL - Answer format requirements:
693
+ 1. Answers must be factoids: a number, a few words, or a comma-separated list
694
+ 2. Be concise - no explanations, just the answer
695
+ 3. If evidence conflicts, evaluate source credibility and recency
696
+ 4. If evidence is insufficient, state "Unable to answer"
697
+
698
+ Examples of good factoid answers:
699
+ - "42"
700
+ - "Paris"
701
+ - "Albert Einstein"
702
+ - "red, blue, green"
703
+ - "1969-07-20"
704
+
705
+ Examples of bad answers (too verbose):
706
+ - "The answer is 42 because..."
707
+ - "Based on the evidence, it appears that..."
708
+ """
709
+
710
+ user_prompt = f"""Question: {question}
711
+
712
+ {evidence_text}
713
+
714
+ Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
715
+
716
+ logger.info(f"[synthesize_answer_hf] Calling HuggingFace for answer synthesis")
717
+
718
+ messages = [
719
+ {"role": "system", "content": system_prompt},
720
+ {"role": "user", "content": user_prompt}
721
+ ]
722
+
723
+ response = client.chat_completion(
724
+ messages=messages,
725
+ max_tokens=256, # Factoid answers are short
726
+ temperature=TEMPERATURE
727
+ )
728
+
729
+ answer = response.choices[0].message.content.strip()
730
+ logger.info(f"[synthesize_answer_hf] Generated answer: {answer}")
731
+
732
+ return answer
733
+
734
+
735
+ # ============================================================================
736
+ # Unified Answer Synthesis with Fallback Chain
737
+ # ============================================================================
738
+
739
  def synthesize_answer(
740
  question: str,
741
  evidence: List[str]
 
743
  """
744
  Synthesize factoid answer from collected evidence using LLM.
745
 
746
+ Pattern: Try Gemini first (free tier), HuggingFace (free tier), then Claude (paid) if both fail.
747
+ 3-tier fallback ensures availability even with quota limits.
748
 
749
  Args:
750
  question: Original GAIA question
 
756
  try:
757
  return synthesize_answer_gemini(question, evidence)
758
  except Exception as gemini_error:
759
+ logger.warning(f"[synthesize_answer] Gemini failed: {gemini_error}, trying HuggingFace fallback")
760
  try:
761
+ return synthesize_answer_hf(question, evidence)
762
+ except Exception as hf_error:
763
+ logger.warning(f"[synthesize_answer] HuggingFace failed: {hf_error}, trying Claude fallback")
764
+ try:
765
+ return synthesize_answer_claude(question, evidence)
766
+ except Exception as claude_error:
767
+ logger.error(f"[synthesize_answer] All LLMs failed. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
768
+ raise Exception(f"Answer synthesis failed with all LLMs. Gemini: {gemini_error}, HF: {hf_error}, Claude: {claude_error}")
769
 
770
 
771
  # ============================================================================
src/tools/__init__.py CHANGED
@@ -17,29 +17,62 @@ from src.tools.calculator import safe_eval
17
  from src.tools.vision import analyze_image, analyze_image_gemini, analyze_image_claude
18
 
19
  # Tool registry with metadata
 
20
  TOOLS = {
21
  "web_search": {
22
  "function": search,
23
  "description": "Search the web using Tavily or Exa APIs with fallback",
24
- "parameters": ["query", "max_results"],
 
 
 
 
 
 
 
 
 
 
25
  "category": "information_retrieval",
26
  },
27
  "parse_file": {
28
  "function": parse_file,
29
  "description": "Parse files (PDF, Excel, Word, Text, CSV) and extract content",
30
- "parameters": ["file_path"],
 
 
 
 
 
 
31
  "category": "file_processing",
32
  },
33
  "calculator": {
34
  "function": safe_eval,
35
  "description": "Safely evaluate mathematical expressions",
36
- "parameters": ["expression"],
 
 
 
 
 
 
37
  "category": "computation",
38
  },
39
  "vision": {
40
  "function": analyze_image,
41
  "description": "Analyze images using multimodal LLMs (Gemini/Claude)",
42
- "parameters": ["image_path", "question"],
 
 
 
 
 
 
 
 
 
 
43
  "category": "multimodal",
44
  },
45
  }
 
17
  from src.tools.vision import analyze_image, analyze_image_gemini, analyze_image_claude
18
 
19
  # Tool registry with metadata
20
+ # Schema matches LLM function calling requirements (parameters as dict, not list)
21
  TOOLS = {
22
  "web_search": {
23
  "function": search,
24
  "description": "Search the web using Tavily or Exa APIs with fallback",
25
+ "parameters": {
26
+ "query": {
27
+ "description": "Search query string",
28
+ "type": "string"
29
+ },
30
+ "max_results": {
31
+ "description": "Maximum number of search results to return (default: 5)",
32
+ "type": "integer"
33
+ }
34
+ },
35
+ "required_params": ["query"],
36
  "category": "information_retrieval",
37
  },
38
  "parse_file": {
39
  "function": parse_file,
40
  "description": "Parse files (PDF, Excel, Word, Text, CSV) and extract content",
41
+ "parameters": {
42
+ "file_path": {
43
+ "description": "Absolute or relative path to the file to parse",
44
+ "type": "string"
45
+ }
46
+ },
47
+ "required_params": ["file_path"],
48
  "category": "file_processing",
49
  },
50
  "calculator": {
51
  "function": safe_eval,
52
  "description": "Safely evaluate mathematical expressions",
53
+ "parameters": {
54
+ "expression": {
55
+ "description": "Mathematical expression to evaluate (e.g., '2 + 2', 'sqrt(16)')",
56
+ "type": "string"
57
+ }
58
+ },
59
+ "required_params": ["expression"],
60
  "category": "computation",
61
  },
62
  "vision": {
63
  "function": analyze_image,
64
  "description": "Analyze images using multimodal LLMs (Gemini/Claude)",
65
+ "parameters": {
66
+ "image_path": {
67
+ "description": "Path to the image file to analyze",
68
+ "type": "string"
69
+ },
70
+ "question": {
71
+ "description": "Question to ask about the image (optional, defaults to 'Describe this image')",
72
+ "type": "string"
73
+ }
74
+ },
75
+ "required_params": ["image_path"],
76
  "category": "multimodal",
77
  },
78
  }