Final_Assignment_Template

Running

App Files Files Community

Paperbag commited on Mar 24

Commit

09f0257

1 Parent(s): 8d79810

opencode

Browse files

Files changed (5) hide show

.opencode/plans/gaia_improvements.md +131 -0
__pycache__/agent.cpython-310.pyc +0 -0
__pycache__/agent.cpython-312.pyc +0 -0
agent.py +47 -35
test_improvements.py +45 -0

.opencode/plans/gaia_improvements.md ADDED Viewed

	@@ -0,0 +1,131 @@

+# GAIA Agent Improvements - Implementation Guide
+## Phase 1: Update LLM Tiers (Lines 63-69 in agent.py)
+Replace the tiers_config list with:
+```python
+tiers_config = [
+    {"name": "OpenRouter-FreeRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "openrouter/free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "DeepSeek-R1", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "deepseek/deepseek-r1:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "Qwen3-Next-80B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-next-80b-a3b-instruct:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "NVIDIA-Nemotron-Super", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-3-super-120b-a12b:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "Gemma-3-27B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "Gemini-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
+    {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
+]
+```
+## Phase 2: Update Vision Models (Lines 180-186 in agent.py)
+Replace the vision model configs with:
+```python
+configs = [
+    {"name": "OpenRouter-Qwen3-VL", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-vl-235b-thinking:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "NVIDIA-Nemotron-VL", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-nano-2-vl:free", "base_url": "https://integrate.api.nvidia.com/v1"},
+    {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
+    {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
+    {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
+]
+```
+## Phase 3: Enhance System Prompt (Lines 495-506 in agent.py)
+Replace the system prompt in answer_message function with:
+```python
+prompt = [SystemMessage(f"""
+You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
+Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
+TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
+CRITICAL RULES:
+1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
+2. Plan your steps ahead. 12 steps is your LIMIT for the reasoning loop, so make every step count.
+3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
+4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
+5. CHAIN-OF-THOUGHT: For complex questions, show your reasoning step by step before giving the final answer.
+6. USE TOOLS AGGRESSIVELY: If a question requires computation, file reading, or web search, use the appropriate tools don't try to answer from memory.
+7. VERIFY YOUR ANSWER: Double-check calculations and facts using tools when uncertain.
+""")]
+```
+## Phase 4: Increase Max Reasoning Steps (Line 515)
+Change:
+```python
+max_steps = 8
+```
+To:
+```python
+max_steps = 12
+```
+## Phase 5: Fix Answer Extraction (Lines 564-575)
+Replace the formatting system message with:
+```python
+formatting_sys = SystemMessage(
+    content=(
+        "You are a strict output formatter for the GAIA benchmark. "
+        "Given a verbose draft answer, extract ONLY the final exact answer required. "
+        "Return nothing else. DO NOT include prefixes like 'The answer is'. "
+        "Strip trailing whitespace only. "
+        "If the answer is a number, just return the number. "
+        "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
+        "Preserve necessary punctuation within answers (e.g., 'Dr. Smith' should keep the period)."
+    )
+)
+```
+## Phase 6: Tool Improvements
+### 6a. Web Search Retry (Lines 146-152 in web_search tool)
+Replace web_search tool body with:
+```python
+@tool
+def web_search(keywords: str) -> str:
+    """
+    Uses duckduckgo to search the top 5 result on web
+    """
+    max_retries = 3
+    for attempt in range(max_retries):
+        try:
+            with DDGS() as ddgs:
+                output = ""
+                results = ddgs.text(keywords, max_results=5)
+                for result in results:
+                    output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
+                return output
+        except Exception as e:
+            if attempt < max_retries - 1:
+                time.sleep(2 ** attempt)  # Exponential backoff
+                continue
+            return f"Search failed after {max_retries} attempts: {str(e)}"
+```
+### 6b. Python Script Timeout (Line 401)
+Change:
+```python
+timeout=30
+```
+To:
+```python
+timeout=60
+```
+---
+## Expected Impact
+- Phase 1-2 (Models): +8-12 questions (40-60% improvement)
+- Phase 3-4 (Prompt): +2-4 questions (10-20% improvement)
+- Phase 5-6 (Tools): +1-2 questions (5-10% improvement)
+Total: From 2/20 (10%) to 13-18/20 (65-90%)

__pycache__/agent.cpython-310.pyc ADDED Viewed

Binary file (20.7 kB). View file

__pycache__/agent.cpython-312.pyc CHANGED Viewed

Binary files a/__pycache__/agent.cpython-312.pyc and b/__pycache__/agent.cpython-312.pyc differ

agent.py CHANGED Viewed

@@ -61,11 +61,13 @@ def smart_invoke(msgs, use_tools=False, start_tier=0):
     gemini_alternatives = ["gemini-2.5-flash", "gemini-2.0-flash", "gemini-flash-latest", "gemini-pro-latest"]
     tiers_config = [
-        {"name": "OpenRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "minimax/minimax-m2.5:free", "base_url": "https://openrouter.ai/api/v1"},
-        {"name": "Gemini", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
         {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
-        {"name": "NVIDIA", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.3-70b-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
-        {"name": "Vercel", "key": "VERCEL_API_KEY", "provider": "openai", "model_name": "meta-llama/llama-3.3-70b-instruct", "base_url": "https://gateway.ai.vercel.com/v1"},
     ]
     last_exception = None
@@ -137,19 +139,26 @@ def web_search(keywords: str) -> str:
      - Finding organisation information
      - Obtain the latest news
-     Args:
-        keywords: keywords used to search the web
-    Returns:
-        Search result (Header + body + url)
-    """
-    with DDGS() as ddgs:
-        # Perform a text search
-        output = ""
-        results = ddgs.text(keywords, max_results = 5)
-        for result in results:
-            output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
-        return(output)
 @tool
 def wiki_search(query: str) -> str:
@@ -178,11 +187,11 @@ def wiki_search(query: str) -> str:
 def get_vision_models():
     """Returns a list of vision models to try, in order of preference."""
     configs = [
         {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
         {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
         {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
-        {"name": "NVIDIA-Vision-Llama-11b", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.2-11b-vision-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
-        {"name": "NVIDIA-Vision-Llama-90b", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "meta/llama-3.2-90b-vision-instruct", "base_url": "https://integrate.api.nvidia.com/v1"},
     ]
     models = []
     for cfg in configs:
@@ -398,7 +407,7 @@ def run_python_script(code: str) -> str:
             ["python", temp_file_name],
             capture_output=True,
             text=True,
-            timeout=30
         )
         os.remove(temp_file_name)
@@ -409,7 +418,7 @@ def run_python_script(code: str) -> str:
         return (output or "Script executed successfully with no output.")[:15000]
     except subprocess.TimeoutExpired:
         os.remove(temp_file_name)
-        return "Script execution timed out after 30 seconds."
     except Exception as e:
         if os.path.exists(temp_file_name):
             os.remove(temp_file_name)
@@ -493,17 +502,20 @@ def answer_message(state: AgentState) -> AgentState:
     current_date = datetime.datetime.now().strftime("%Y-%m-%d")
     prompt = [SystemMessage(f"""
-    You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
-    Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
-    TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
-    CRITICAL RULES:
-    1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
-    2. Plan your steps ahead. 8 steps is your LIMIT for the reasoning loop, so make every step count.
-    3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
-    4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
-    """)]
     messages = prompt + messages
     # Force tool usage if image path is detected
@@ -511,8 +523,8 @@ def answer_message(state: AgentState) -> AgentState:
         if isinstance(msg, HumanMessage) and "[Attached File Local Path:" in msg.content:
             messages.append(HumanMessage(content="IMPORTANT: I see an image path in the message. I MUST call the analyze_image tool IMMEDIATELY in my next step to see it."))
-    # Multi-step ReAct Loop (Up to 8 reasoning steps)
-    max_steps = 8
     draft_response = None
     current_tier = 0
@@ -566,10 +578,10 @@ def answer_message(state: AgentState) -> AgentState:
             "You are a strict output formatter for the GAIA benchmark. "
             "Given a verbose draft answer, extract ONLY the final exact answer required. "
             "Return nothing else. DO NOT include prefixes like 'The answer is'. "
-            "Strip trailing punctuation like periods and quotes. "
             "If the answer is a number, just return the number. "
             "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
-            "DO NOT strip commas that separate list items."
         )
     )
     final_response, _ = smart_invoke([formatting_sys, HumanMessage(content=extract_text_from_content(draft_response.content))], use_tools=False, start_tier=current_tier)

     gemini_alternatives = ["gemini-2.5-flash", "gemini-2.0-flash", "gemini-flash-latest", "gemini-pro-latest"]
     tiers_config = [
+        {"name": "OpenRouter-FreeRouter", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "openrouter/free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "DeepSeek-R1", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "deepseek/deepseek-r1:free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "Qwen3-Next-80B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-next-80b-a3b-instruct:free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "NVIDIA-Nemotron-Super", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-3-super-120b-a12b:free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "Gemma-3-27B", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "Gemini-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash", "alternatives": gemini_alternatives},
         {"name": "Groq", "key": "GROQ_API_KEY", "provider": "groq", "model_name": "llama-3.3-70b-versatile"},
     ]
     last_exception = None
      - Finding organisation information
      - Obtain the latest news
+      Args:
+         keywords: keywords used to search the web
+     Returns:
+         Search result (Header + body + url)
+     """
+    max_retries = 3
+    for attempt in range(max_retries):
+        try:
+            with DDGS() as ddgs:
+                output = ""
+                results = ddgs.text(keywords, max_results = 5)
+                for result in results:
+                    output += f"Results: {result['title']}\n{result['body']}\n{result['href']}\n\n"
+                return output
+        except Exception as e:
+            if attempt < max_retries - 1:
+                time.sleep(2 ** attempt)
+                continue
+            return f"Search failed after {max_retries} attempts: {str(e)}"
 @tool
 def wiki_search(query: str) -> str:
 def get_vision_models():
     """Returns a list of vision models to try, in order of preference."""
     configs = [
+        {"name": "OpenRouter-Qwen3-VL", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "qwen/qwen3-vl-235b-thinking:free", "base_url": "https://openrouter.ai/api/v1"},
+        {"name": "NVIDIA-Nemotron-VL", "key": "NVIDIA_API_KEY", "provider": "openai", "model_name": "nvidia/nemotron-nano-2-vl:free", "base_url": "https://integrate.api.nvidia.com/v1"},
         {"name": "OpenRouter-Gemma-3-27b-it", "key": "OPENROUTER_API_KEY", "provider": "openai", "model_name": "google/gemma-3-27b-it:free", "base_url": "https://openrouter.ai/api/v1"},
         {"name": "Google-Gemini-2.0-Flash", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-2.0-flash"},
         {"name": "Google-Gemini-Flash-Latest", "key": "GOOGLE_API_KEY", "provider": "google", "model_name": "gemini-flash-latest"},
     ]
     models = []
     for cfg in configs:
             ["python", temp_file_name],
             capture_output=True,
             text=True,
+            timeout=60
         )
         os.remove(temp_file_name)
         return (output or "Script executed successfully with no output.")[:15000]
     except subprocess.TimeoutExpired:
         os.remove(temp_file_name)
+        return "Script execution timed out after 60 seconds."
     except Exception as e:
         if os.path.exists(temp_file_name):
             os.remove(temp_file_name)
     current_date = datetime.datetime.now().strftime("%Y-%m-%d")
     prompt = [SystemMessage(f"""
+You are a master of the GAIA benchmark, a general AI assistant designed to solve complex multi-step tasks.
+Think carefully and logically. Use your tools effectively. Use your internal monologue to plan your steps.
+TODAY'S EXACT DATE is {current_date}. Keep this in mind for all time-sensitive queries.
+CRITICAL RULES:
+1. If you see a path like `[Attached File Local Path: ...]` followed by an image, video, or audio file, YOU MUST USE THE CORRESPONDING TOOL (analyze_image, analyze_video, analyze_audio) IMMEDIATELY in your next step.
+2. Plan your steps ahead. 12 steps is your LIMIT for the reasoning loop, so make every step count.
+3. If a tool fails (e.g., 429 or 402), the system will automatically try another model for you, so just keep going!
+4. Be concise and accurate. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list.
+5. CHAIN-OF-THOUGHT: For complex questions, show your reasoning step by step before giving the final answer.
+6. USE TOOLS AGGRESSIVELY: If a question requires computation, file reading, or web search, use the appropriate tools - don't try to answer from memory.
+7. VERIFY YOUR ANSWER: Double-check calculations and facts using tools when uncertain.
+""")]
     messages = prompt + messages
     # Force tool usage if image path is detected
         if isinstance(msg, HumanMessage) and "[Attached File Local Path:" in msg.content:
             messages.append(HumanMessage(content="IMPORTANT: I see an image path in the message. I MUST call the analyze_image tool IMMEDIATELY in my next step to see it."))
+    # Multi-step ReAct Loop (Up to 12 reasoning steps)
+    max_steps = 12
     draft_response = None
     current_tier = 0
             "You are a strict output formatter for the GAIA benchmark. "
             "Given a verbose draft answer, extract ONLY the final exact answer required. "
             "Return nothing else. DO NOT include prefixes like 'The answer is'. "
+            "Strip trailing whitespace only. "
             "If the answer is a number, just return the number. "
             "If the answer is a list or set of elements, return them as a COMMA-SEPARATED list (e.g., 'a, b, c'). "
+            "Preserve necessary punctuation within answers (e.g., 'Dr. Smith' should keep the period)."
         )
     )
     final_response, _ = smart_invoke([formatting_sys, HumanMessage(content=extract_text_from_content(draft_response.content))], use_tools=False, start_tier=current_tier)

test_improvements.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from langchain_core.messages import HumanMessage
+from agent import build_graph
+def test_agent():
+    print("=" * 50)
+    print("Testing GAIA Agent with improvements...")
+    print("=" * 50)
+    # Build the agent graph
+    graph = build_graph()
+    # Test question - simple math/reasoning
+    test_question = "What is 15 + 27?"
+    print(f"\nQuestion: {test_question}")
+    print("-" * 30)
+    messages = [HumanMessage(content=test_question)]
+    result = graph.invoke({"messages": messages})
+    answer = result['messages'][-1].content
+    print(f"\nFinal Answer: {answer}")
+    print("-" * 30)
+    # Test another question requiring web search
+    test_question2 = "What is the capital of France?"
+    print(f"\nQuestion: {test_question2}")
+    print("-" * 30)
+    messages2 = [HumanMessage(content=test_question2)]
+    result2 = graph.invoke({"messages": messages2})
+    answer2 = result2['messages'][-1].content
+    print(f"\nFinal Answer: {answer2}")
+    print("-" * 30)
+    print("\nTest completed successfully!")
+if __name__ == "__main__":
+    test_agent()