Paperbag commited on
Commit
09f0257
·
1 Parent(s): 8d79810
.opencode/plans/gaia_improvements.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GAIA Agent Improvements - Implementation Guide
2
+
3
+ ## Phase 1: Update LLM Tiers (Lines 63-69 in agent.py)
4
+
5
+ Replace the tiers_config list with:
6
+
7
+ ```python
8
+ tiers_config = [
9
+ {"name": "OpenRouter-FreeRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "openrouter/free", "base_url": "https://openrouter.ai/api/v1"},
10
+ {"name": "DeepSeek-R1", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "deepseek/deepseek-r1:free", "base_url": "https://openrouter.ai/api/v1"},
11
+ {"name": "Qwen3-Next-80B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-next-80b-a3b-instruct:free", "base_url": "https://openrouter.ai/api/v1"},
12
+ {"name": "NVIDIA-Nemotron-Super", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-3-super-120b-a12b:free", "base_url": "https://openrouter.ai/api/v1"},
13
+ {"name": "Gemma-3-27B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
14
+ {"name": "Gemini-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
15
+ {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
16
+ ]
17
+ ```
18
+
19
+ ## Phase 2: Update Vision Models (Lines 180-186 in agent.py)
20
+
21
+ Replace the vision model configs with:
22
+
23
+ ```python
24
+ configs = [
25
+ {"name": "OpenRouter-Qwen3-VL", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-vl-235b-thinking:free", "base_url": "https://openrouter.ai/api/v1"},
26
+ {"name": "NVIDIA-Nemotron-VL", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-nano-2-vl:free", "base_url": "https://integrate.api.nvidia.com/v1"},
27
+ {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
28
+ {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
29
+ {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
30
+ ]
31
+ ```
32
+
33
+ ## Phase 3: Enhance System Prompt (Lines 495-506 in agent.py)
34
+
35
+ Replace the system prompt in answer_message function with:
36
+
37
+ ```python
38
+ prompt = [SystemMessage(f"""
39
+ You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
40
+ Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
41
+
42
+ TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
43
+
44
+ CRITICAL RULES:
45
+ 1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
46
+ 2. Plan your steps ahead. 12 steps is your LIMIT for the reasoning loop, so make every step count.
47
+ 3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
48
+ 4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
49
+ 5. CHAIN-OF-THOUGHT: For complex questions, show your reasoning step by step before giving the final answer.
50
+ 6. USE TOOLS AGGRESSIVELY: If a question requires computation, file reading, or web search, use the appropriate tools don't try to answer from memory.
51
+ 7. VERIFY YOUR ANSWER: Double-check calculations and facts using tools when uncertain.
52
+ """)]
53
+ ```
54
+
55
+ ## Phase 4: Increase Max Reasoning Steps (Line 515)
56
+
57
+ Change:
58
+ ```python
59
+ max_steps = 8
60
+ ```
61
+ To:
62
+ ```python
63
+ max_steps = 12
64
+ ```
65
+
66
+ ## Phase 5: Fix Answer Extraction (Lines 564-575)
67
+
68
+ Replace the formatting system message with:
69
+
70
+ ```python
71
+ formatting_sys = SystemMessage(
72
+ content=(
73
+ "You are a strict output formatter for the GAIA benchmark. "
74
+ "Given a verbose draft answer, extract ONLY the final exact answer required. "
75
+ "Return nothing else. DO NOT include prefixes like 'The answer is'. "
76
+ "Strip trailing whitespace only. "
77
+ "If the answer is a number, just return the number. "
78
+ "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
79
+ "Preserve necessary punctuation within answers (e.g., 'Dr. Smith' should keep the period)."
80
+ )
81
+ )
82
+ ```
83
+
84
+ ## Phase 6: Tool Improvements
85
+
86
+ ### 6a. Web Search Retry (Lines 146-152 in web_search tool)
87
+
88
+ Replace web_search tool body with:
89
+
90
+ ```python
91
+ @tool
92
+ def web_search(keywords: str) -> str:
93
+ """
94
+ Uses duckduckgo to search the top 5 result on web
95
+ """
96
+ max_retries = 3
97
+ for attempt in range(max_retries):
98
+ try:
99
+ with DDGS() as ddgs:
100
+ output = ""
101
+ results = ddgs.text(keywords, max_results=5)
102
+ for result in results:
103
+ output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
104
+ return output
105
+ except Exception as e:
106
+ if attempt < max_retries - 1:
107
+ time.sleep(2 ** attempt) # Exponential backoff
108
+ continue
109
+ return f"Search failed after {max_retries} attempts: {str(e)}"
110
+ ```
111
+
112
+ ### 6b. Python Script Timeout (Line 401)
113
+
114
+ Change:
115
+ ```python
116
+ timeout=30
117
+ ```
118
+ To:
119
+ ```python
120
+ timeout=60
121
+ ```
122
+
123
+ ---
124
+
125
+ ## Expected Impact
126
+
127
+ - Phase 1-2 (Models): +8-12 questions (40-60% improvement)
128
+ - Phase 3-4 (Prompt): +2-4 questions (10-20% improvement)
129
+ - Phase 5-6 (Tools): +1-2 questions (5-10% improvement)
130
+
131
+ Total: From 2/20 (10%) to 13-18/20 (65-90%)
__pycache__/agent.cpython-310.pyc ADDED
Binary file (20.7 kB). View file
 
__pycache__/agent.cpython-312.pyc CHANGED
Binary files a/__pycache__/agent.cpython-312.pyc and b/__pycache__/agent.cpython-312.pyc differ
 
agent.py CHANGED
@@ -61,11 +61,13 @@ def smart_invoke(msgs, use_tools=False, start_tier=0):
61
  gemini_alternatives = ["gemini-2.5-flash", "gemini-2.0-flash", "gemini-flash-latest", "gemini-pro-latest"]
62
 
63
  tiers_config = [
64
- {"name": "OpenRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "minimax/minimax-m2.5:free", "base_url": "https://openrouter.ai/api/v1"},
65
- {"name": "Gemini", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
 
 
 
 
66
  {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
67
- {"name": "NVIDIA", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.3-70b-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
68
- {"name": "Vercel", "key": "VERCEL_API_KEY", "provider": "openai", "model_name": "meta-llama/llama-3.3-70b-instruct", "base_url": "https://gateway.ai.vercel.com/v1"},
69
  ]
70
 
71
  last_exception = None
@@ -137,19 +139,26 @@ def web_search(keywords: str) -> str:
137
  - Finding organisation information
138
  - Obtain the latest news
139
 
140
- Args:
141
- keywords: keywords used to search the web
142
 
143
- Returns:
144
- Search result (Header + body + url)
145
- """
146
- with DDGS() as ddgs:
147
- # Perform a text search
148
- output = ""
149
- results = ddgs.text(keywords, max_results = 5)
150
- for result in results:
151
- output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
152
- return(output)
 
 
 
 
 
 
 
153
 
154
  @tool
155
  def wiki_search(query: str) -> str:
@@ -178,11 +187,11 @@ def wiki_search(query: str) -> str:
178
  def get_vision_models():
179
  """Returns a list of vision models to try, in order of preference."""
180
  configs = [
 
 
181
  {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
182
  {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
183
  {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
184
- {"name": "NVIDIA-Vision-Llama-11b", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.2-11b-vision-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
185
- {"name": "NVIDIA-Vision-Llama-90b", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.2-90b-vision-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
186
  ]
187
  models = []
188
  for cfg in configs:
@@ -398,7 +407,7 @@ def run_python_script(code: str) -> str:
398
  ["python", temp_file_name],
399
  capture_output=True,
400
  text=True,
401
- timeout=30
402
  )
403
  os.remove(temp_file_name)
404
 
@@ -409,7 +418,7 @@ def run_python_script(code: str) -> str:
409
  return (output or "Script executed successfully with no output.")[:15000]
410
  except subprocess.TimeoutExpired:
411
  os.remove(temp_file_name)
412
- return "Script execution timed out after 30 seconds."
413
  except Exception as e:
414
  if os.path.exists(temp_file_name):
415
  os.remove(temp_file_name)
@@ -493,17 +502,20 @@ def answer_message(state: AgentState) -> AgentState:
493
  current_date = datetime.datetime.now().strftime("%Y-%m-%d")
494
 
495
  prompt = [SystemMessage(f"""
496
- You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
497
- Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
498
-
499
- TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
500
-
501
- CRITICAL RULES:
502
- 1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
503
- 2. Plan your steps ahead. 8 steps is your LIMIT for the reasoning loop, so make every step count.
504
- 3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
505
- 4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
506
- """)]
 
 
 
507
  messages = prompt + messages
508
 
509
  # Force tool usage if image path is detected
@@ -511,8 +523,8 @@ def answer_message(state: AgentState) -> AgentState:
511
  if isinstance(msg, HumanMessage) and "[Attached File Local Path:" in msg.content:
512
  messages.append(HumanMessage(content="IMPORTANT: I see an image path in the message. I MUST call the analyze_image tool IMMEDIATELY in my next step to see it."))
513
 
514
- # Multi-step ReAct Loop (Up to 8 reasoning steps)
515
- max_steps = 8
516
  draft_response = None
517
  current_tier = 0
518
 
@@ -566,10 +578,10 @@ def answer_message(state: AgentState) -> AgentState:
566
  "You are a strict output formatter for the GAIA benchmark. "
567
  "Given a verbose draft answer, extract ONLY the final exact answer required. "
568
  "Return nothing else. DO NOT include prefixes like 'The answer is'. "
569
- "Strip trailing punctuation like periods and quotes. "
570
  "If the answer is a number, just return the number. "
571
  "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
572
- "DO NOT strip commas that separate list items."
573
  )
574
  )
575
  final_response, _ = smart_invoke([formatting_sys, HumanMessage(content=extract_text_from_content(draft_response.content))], use_tools=False, start_tier=current_tier)
 
61
  gemini_alternatives = ["gemini-2.5-flash", "gemini-2.0-flash", "gemini-flash-latest", "gemini-pro-latest"]
62
 
63
  tiers_config = [
64
+ {"name": "OpenRouter-FreeRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "openrouter/free", "base_url": "https://openrouter.ai/api/v1"},
65
+ {"name": "DeepSeek-R1", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "deepseek/deepseek-r1:free", "base_url": "https://openrouter.ai/api/v1"},
66
+ {"name": "Qwen3-Next-80B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-next-80b-a3b-instruct:free", "base_url": "https://openrouter.ai/api/v1"},
67
+ {"name": "NVIDIA-Nemotron-Super", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-3-super-120b-a12b:free", "base_url": "https://openrouter.ai/api/v1"},
68
+ {"name": "Gemma-3-27B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
69
+ {"name": "Gemini-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
70
  {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
 
 
71
  ]
72
 
73
  last_exception = None
 
139
  - Finding organisation information
140
  - Obtain the latest news
141
 
142
+ Args:
143
+ keywords: keywords used to search the web
144
 
145
+ Returns:
146
+ Search result (Header + body + url)
147
+ """
148
+ max_retries = 3
149
+ for attempt in range(max_retries):
150
+ try:
151
+ with DDGS() as ddgs:
152
+ output = ""
153
+ results = ddgs.text(keywords, max_results = 5)
154
+ for result in results:
155
+ output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
156
+ return output
157
+ except Exception as e:
158
+ if attempt < max_retries - 1:
159
+ time.sleep(2 ** attempt)
160
+ continue
161
+ return f"Search failed after {max_retries} attempts: {str(e)}"
162
 
163
  @tool
164
  def wiki_search(query: str) -> str:
 
187
  def get_vision_models():
188
  """Returns a list of vision models to try, in order of preference."""
189
  configs = [
190
+ {"name": "OpenRouter-Qwen3-VL", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-vl-235b-thinking:free", "base_url": "https://openrouter.ai/api/v1"},
191
+ {"name": "NVIDIA-Nemotron-VL", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-nano-2-vl:free", "base_url": "https://integrate.api.nvidia.com/v1"},
192
  {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
193
  {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
194
  {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
 
 
195
  ]
196
  models = []
197
  for cfg in configs:
 
407
  ["python", temp_file_name],
408
  capture_output=True,
409
  text=True,
410
+ timeout=60
411
  )
412
  os.remove(temp_file_name)
413
 
 
418
  return (output or "Script executed successfully with no output.")[:15000]
419
  except subprocess.TimeoutExpired:
420
  os.remove(temp_file_name)
421
+ return "Script execution timed out after 60 seconds."
422
  except Exception as e:
423
  if os.path.exists(temp_file_name):
424
  os.remove(temp_file_name)
 
502
  current_date = datetime.datetime.now().strftime("%Y-%m-%d")
503
 
504
  prompt = [SystemMessage(f"""
505
+ You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
506
+ Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
507
+
508
+ TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
509
+
510
+ CRITICAL RULES:
511
+ 1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
512
+ 2. Plan your steps ahead. 12 steps is your LIMIT for the reasoning loop, so make every step count.
513
+ 3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
514
+ 4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
515
+ 5. CHAIN-OF-THOUGHT: For complex questions, show your reasoning step by step before giving the final answer.
516
+ 6. USE TOOLS AGGRESSIVELY: If a question requires computation, file reading, or web search, use the appropriate tools - don't try to answer from memory.
517
+ 7. VERIFY YOUR ANSWER: Double-check calculations and facts using tools when uncertain.
518
+ """)]
519
  messages = prompt + messages
520
 
521
  # Force tool usage if image path is detected
 
523
  if isinstance(msg, HumanMessage) and "[Attached File Local Path:" in msg.content:
524
  messages.append(HumanMessage(content="IMPORTANT: I see an image path in the message. I MUST call the analyze_image tool IMMEDIATELY in my next step to see it."))
525
 
526
+ # Multi-step ReAct Loop (Up to 12 reasoning steps)
527
+ max_steps = 12
528
  draft_response = None
529
  current_tier = 0
530
 
 
578
  "You are a strict output formatter for the GAIA benchmark. "
579
  "Given a verbose draft answer, extract ONLY the final exact answer required. "
580
  "Return nothing else. DO NOT include prefixes like 'The answer is'. "
581
+ "Strip trailing whitespace only. "
582
  "If the answer is a number, just return the number. "
583
  "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
584
+ "Preserve necessary punctuation within answers (e.g., 'Dr. Smith' should keep the period)."
585
  )
586
  )
587
  final_response, _ = smart_invoke([formatting_sys, HumanMessage(content=extract_text_from_content(draft_response.content))], use_tools=False, start_tier=current_tier)
test_improvements.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
4
+
5
+ from langchain_core.messages import HumanMessage
6
+ from agent import build_graph
7
+
8
+ def test_agent():
9
+ print("=" * 50)
10
+ print("Testing GAIA Agent with improvements...")
11
+ print("=" * 50)
12
+
13
+ # Build the agent graph
14
+ graph = build_graph()
15
+
16
+ # Test question - simple math/reasoning
17
+ test_question = "What is 15 + 27?"
18
+
19
+ print(f"\nQuestion: {test_question}")
20
+ print("-" * 30)
21
+
22
+ messages = [HumanMessage(content=test_question)]
23
+ result = graph.invoke({"messages": messages})
24
+
25
+ answer = result['messages'][-1].content
26
+ print(f"\nFinal Answer: {answer}")
27
+ print("-" * 30)
28
+
29
+ # Test another question requiring web search
30
+ test_question2 = "What is the capital of France?"
31
+
32
+ print(f"\nQuestion: {test_question2}")
33
+ print("-" * 30)
34
+
35
+ messages2 = [HumanMessage(content=test_question2)]
36
+ result2 = graph.invoke({"messages": messages2})
37
+
38
+ answer2 = result2['messages'][-1].content
39
+ print(f"\nFinal Answer: {answer2}")
40
+ print("-" * 30)
41
+
42
+ print("\nTest completed successfully!")
43
+
44
+ if __name__ == "__main__":
45
+ test_agent()