Final_Assignment_AGENT_GAIA

Sleeping

App Files Files Community

Isateles commited on May 30, 2025

Commit

394d24e

1 Parent(s): 3120256

Update GAIA agent-fixes and test code

Browse files

Files changed (8) hide show

README.md +117 -0
__pycache__/app.cpython-312.pyc +0 -0
__pycache__/tools.cpython-312.pyc +0 -0
app.py +199 -106
test_gaia_agent.py +420 -0
test_google_search.py +143 -0
test_local.py +0 -216
tools.py +236 -29

README.md CHANGED Viewed

@@ -34,6 +34,123 @@ My agent uses:
 ## 🔧 Tools Implemented
 1. **Web Search** (`web_search`): Uses DuckDuckGo to find current information
 2. **Calculator** (`calculator`): Handles math, percentages, and word problems
 3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files

 ## 🔧 Tools Implemented
+1. **Web Search** (`web_search`): Uses Google Search (with DuckDuckGo fallback)
+2. **Calculator** (`calculator`): Handles math, percentages, and word problems
+3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files
+4. **Weather** (`weather`): Real weather data using OpenWeather API
+5. **Persona Database** (`persona_database`): RAG system for finding personas
+## 💡 Key Insights
+The biggest challenge was understanding that the course evaluation uses **exact match** on clean answers. The GAIA prompt helps the agent reason well, but I needed to extract just the answer part (without "FINAL ANSWER:") for submission.
+### Smart Agent Strategy:
+- **Knowledge First**: The agent tries to answer from its extensive knowledge (up to January 2025)
+- **Search When Needed**: Only searches for current info, verification, or when explicitly asked
+- **Google Priority**: Uses Google Custom Search first (most reliable in HF Spaces)
+- **DuckDuckGo Fallback**: Multiple methods to ensure search works even if one fails
+- **Clean Answers**: Extracts exactly what GAIA expects (no units, articles, or formatting)
+## 🚀 Features
+- Clean answer extraction aligned with GAIA scoring rules
+- Handles numbers without commas/units as required
+- Properly formats lists and yes/no answers
+- RAG integration for persona queries
+- Real weather data when API key is available
+- Fallback mechanisms for robustness
+## 📋 Requirements
+All dependencies are in `requirements.txt`. The key ones are:
+- LlamaIndex (core framework)
+- Gradio (web interface)
+- ChromaDB (vector storage)
+- DuckDuckGo Search (web tool)
+## 🔑 API Keys Needed
+Add these to your HuggingFace Space secrets:
+- `GROQ_API_KEY` (recommended - fast and free)
+- `ANTHROPIC_API_KEY` or `CLAUDE_API_KEY` (best performance)
+- `TOGETHER_API_KEY` (good alternative)
+- `HF_TOKEN` (free fallback)
+- `OPENAI_API_KEY` (if you have credits)
+### For Web Search:
+- `GOOGLE_API_KEY` (required for web search)
+  - Your Google Custom Search Engine ID is already configured: `746382dd3c2bd4135`
+  - Google Search is prioritized first, then DuckDuckGo as fallback
+  - If you see "quota exceeded", check your Google Cloud Console usage
+### Optional:
+- `OPENWEATHER_API_KEY` (for real weather data)
+## 🔍 Troubleshooting Web Search
+If Google Search isn't working:
+1. Check your API key is correct in HF Secrets
+2. Verify the Custom Search API is enabled in Google Cloud Console
+3. Check your quota hasn't been exceeded (300 queries/day free tier)
+4. The CSE ID `746382dd3c2bd4135` should work, but you can override with `GOOGLE_CSE_ID` env var
+If all web search fails, the agent will use its knowledge base (up to Jan 2025).
+## 📊 Expected Performance
+Based on my testing and understanding of GAIA:
+- Math questions: Should score well with the calculator tool
+- Factual questions: Web search helps find current information
+- Data questions: File analyzer handles CSV analysis
+- Simple logic: GAIA prompt guides proper reasoning
+Target: 30%+ to pass the course!
+## 🛠️ How It Works
+1. **Question Processing**: Agent receives a GAIA question
+2. **Tool Selection**: Uses the right tools based on the question
+3. **Reasoning**: Follows GAIA prompt to think through the problem
+4. **Answer Extraction**: Extracts clean answer for exact match
+5. **Submission**: Sends properly formatted answer to evaluation
+## 📝 Course Learnings Applied
+- **Agent Architecture**: Using AgentWorkflow as taught in the course
+- **Tool Integration**: Each tool has a clear purpose and description
+- **RAG System**: Persona database shows RAG implementation
+- **Prompt Engineering**: GAIA prompt for structured reasoning
+- **Error Handling**: Graceful fallbacks instead of crashes
+## 🎯 Goal
+Pass the GAIA evaluation with 30%+ score by applying everything learned in the AI Agents course!
+---
+*This project demonstrates practical application of agent concepts, tool integration, RAG systems, and prompt engineering as taught in the course.*
+This is my submission for the AI Agents course final project. I've built a RAG agent to tackle the GAIA benchmark using everything we learned in the course!
+## 🎓 What I Learned & Applied
+Throughout this course, I learned about:
+- Building agents with LlamaIndex AgentWorkflow
+- Creating and integrating tools (web search, calculator, file analysis)
+- Implementing RAG systems with vector databases
+- Proper prompting techniques for agent systems
+- Working with multiple LLM providers
+## 🏗️ Architecture
+My agent uses:
+- **LlamaIndex AgentWorkflow**: For orchestrating the agent's reasoning
+- **Multiple LLMs**: Supports Claude, Groq, Together AI, HuggingFace, and OpenAI
+- **ChromaDB**: For the persona RAG database
+- **GAIA System Prompt**: To ensure proper reasoning and answer formatting
+## 🔧 Tools Implemented
 1. **Web Search** (`web_search`): Uses DuckDuckGo to find current information
 2. **Calculator** (`calculator`): Handles math, percentages, and word problems
 3. **File Analyzer** (`file_analyzer`): Analyzes CSV and text files

__pycache__/app.cpython-312.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-312.pyc and b/__pycache__/app.cpython-312.pyc differ

__pycache__/tools.cpython-312.pyc ADDED Viewed

Binary file (26.2 kB). View file

app.py CHANGED Viewed

@@ -12,151 +12,162 @@ import logging
 import re
 import string
 from typing import List, Dict, Any, Optional
 # Logging setup
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
 # Constants
 GAIA_API_URL = "https://agents-course-unit4-scoring.hf.space"
 PASSING_SCORE = 30
-# GAIA System Prompt - for internal reasoning
-GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string."""
 def setup_llm():
     """Initialize the best available LLM"""
-    # Priority: Claude > Groq > Together > HF > OpenAI
-    # if api_key := (os.getenv("ANTHROPIC_API_KEY") or os.getenv("CLAUDE_API_KEY")):
-    #     try:
-    #         from llama_index.llms.anthropic import Anthropic
-    #         llm = Anthropic(
-    #             api_key=api_key,
-    #             model="claude-3-5-sonnet-20241022",
-    #             temperature=0.0,
-    #             max_tokens=2048
-    #         )
-    #         logger.info("✅ Using Claude 3.5 Sonnet")
-    #         return llm
-    #     except Exception as e:
-    #         logger.warning(f"Claude setup failed: {e}")
     if api_key := os.getenv("GROQ_API_KEY"):
         try:
             from llama_index.llms.groq import Groq
             llm = Groq(
                 api_key=api_key,
-                model="meta-llama/llama-4-scout-17b-16e-instruct",
                 temperature=0.0,
                 max_tokens=2048
             )
-            logger.info("✅ Using Groq Llama 3 70B")
             return llm
         except Exception as e:
             logger.warning(f"Groq setup failed: {e}")
     if api_key := os.getenv("TOGETHER_API_KEY"):
         try:
-            from llama_index.llms.together import Together
-            llm = Together(
                 api_key=api_key,
-                model="deepseek-ai/DeepSeek-V3",
                 temperature=0.0,
                 max_tokens=2048
             )
-            logger.info("✅ Using Together AI")
             return llm
         except Exception as e:
             logger.warning(f"Together setup failed: {e}")
-    if api_key := os.getenv("HF_TOKEN"):
-        try:
-            from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
-            llm = HuggingFaceInferenceAPI(
-                model_name="meta-llama/Llama-3.1-70B-Instruct",
-                token=api_key,
-                temperature=0.0
-            )
-            logger.info("✅ Using HuggingFace")
-            return llm
-        except Exception as e:
-            logger.warning(f"HuggingFace setup failed: {e}")
-    if api_key := os.getenv("OPENAI_API_KEY"):
-        try:
-            from llama_index.llms.openai import OpenAI
-            llm = OpenAI(
-                api_key=api_key,
-                model="gpt-4o-mini",
-                temperature=0.0,
-                max_tokens=2048
-            )
-            logger.info("✅ Using OpenAI")
-            return llm
-        except Exception as e:
-            logger.warning(f"OpenAI setup failed: {e}")
-    raise RuntimeError("No LLM API key found! Set one of: ANTHROPIC_API_KEY, GROQ_API_KEY, TOGETHER_API_KEY, HF_TOKEN, OPENAI_API_KEY")
 def extract_final_answer(response_text: str) -> str:
     """Extract answer aligned with GAIA scoring rules"""
     match = re.search(r"FINAL ANSWER:\s*(.+?)(?:\n|$)", response_text, re.IGNORECASE | re.DOTALL)
     if not match:
-        logger.warning("No FINAL ANSWER found")
-        return ""
-    answer = match.group(1).strip()
     # Clean for GAIA scoring
-    # 1. Numbers: remove units and formatting
-    if re.match(r'^[\d$%,.\s]+$', answer):
-        cleaned = answer.replace('$', '').replace('%', '').replace(',', '')
         try:
             num = float(cleaned)
             return str(int(num)) if num.is_integer() else str(num)
         except:
             pass
-    # 2. Lists: consistent comma separation
-    if ',' in answer or ';' in answer:
-        items = re.split(r'[,;]', answer)
-        cleaned_items = []
-        for item in items:
-            item = item.strip()
             # Try to parse as number
             try:
-                cleaned = item.replace('$', '').replace('%', '').replace(',', '')
-                num = float(cleaned)
-                cleaned_items.append(str(int(num)) if num.is_integer() else str(num))
             except:
-                # Keep as string
-                cleaned_items.append(item)
-        return ', '.join(cleaned_items)
-    # 3. Yes/no: lowercase
     if answer.lower() in ['yes', 'no']:
         return answer.lower()
-    # 4. Single words/strings: remove articles if at start
     words = answer.split()
     if words and words[0].lower() in ['the', 'a', 'an']:
         return ' '.join(words[1:])
     return answer
 class GAIAAgent:
     """GAIA RAG Agent using LlamaIndex AgentWorkflow"""
     def __init__(self):
         logger.info("Initializing GAIA RAG Agent...")
         # Initialize LLM
         self.llm = setup_llm()
@@ -184,44 +195,120 @@ class GAIAAgent:
         """Process a question and return clean answer for course submission"""
         logger.info(f"Processing question: {question[:100]}...")
         try:
-            # Run agent asynchronously
             loop = asyncio.new_event_loop()
             asyncio.set_event_loop(loop)
             try:
                 async def run_agent():
-                    handler = self.agent.run(user_msg=question)
-                    # Log tool usage
-                    from llama_index.core.agent.workflow import ToolCallResult
-                    async for event in handler.stream_events():
-                        if isinstance(event, ToolCallResult):
-                            logger.info(f"Tool used: {event.tool_name}")
-                    result = await handler
-                    return result
-                result = loop.run_until_complete(run_agent())
-                # Extract response text
-                if hasattr(result, 'response'):
-                    response_text = str(result.response)
-                else:
-                    response_text = str(result)
-                # Extract clean answer (no "FINAL ANSWER:" prefix)
                 clean_answer = extract_final_answer(response_text)
-                logger.info(f"Final answer: '{clean_answer}'")
                 return clean_answer
             finally:
                 loop.close()
         except Exception as e:
             logger.error(f"Error processing question: {e}")
-            return ""
 def run_and_submit_all(profile: gr.OAuthProfile | None):
     """Run GAIA evaluation following course template structure"""
@@ -351,24 +438,29 @@ Message: {result_data.get('message', 'Evaluation complete')}"""
         return error_msg, pd.DataFrame(results_log)
 # Gradio Interface
-with gr.Blocks(title="GAIA RAG Agent") as demo:
-    gr.Markdown("# Isadora Teles - GAIA Agent - Final HF Agents Project")
     gr.Markdown("""
-    This is a RAG agent implementation, with multiple tools, for the GAIA benchmark.
-    TEST 2.
     **Features:**
-    - 🧠 LlamaIndex AgentWorkflow with GAIA prompt
-    - 🔍 Web search for current information
-    - 🧮 Calculator for mathematical problems
     - 📊 File analyzer for data questions
-    - 👥 RAG persona database
     - ✅ Clean answer extraction for exact match
     **Instructions:**
     1. Log in with HuggingFace account
     2. Click 'Run Evaluation & Submit All Answers'
-    3. Wait for the agent to process all questions (5-10 minutes)
     4. Check your score!
     """)
@@ -407,20 +499,21 @@ if __name__ == "__main__":
     # Check API keys
     api_keys = [
-        #("Claude", os.getenv("ANTHROPIC_API_KEY") or os.getenv("CLAUDE_API_KEY")),
         ("Groq", os.getenv("GROQ_API_KEY")),
         ("Together", os.getenv("TOGETHER_API_KEY")),
         ("HuggingFace", os.getenv("HF_TOKEN")),
         ("OpenAI", os.getenv("OPENAI_API_KEY")),
         ("OpenWeather", os.getenv("OPENWEATHER_API_KEY"))
     ]
     available = [name for name, key in api_keys if key]
     if available:
-        print(f"✅ Available LLMs: {', '.join(available)}")
     else:
-        print("❌ No LLM API keys found!")
     print("="*60 + "\n")

 import re
 import string
 from typing import List, Dict, Any, Optional
+import warnings
+warnings.filterwarnings("ignore", category=RuntimeWarning, module="asyncio")
 # Logging setup
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
 logger = logging.getLogger(__name__)
 # Constants
 GAIA_API_URL = "https://agents-course-unit4-scoring.hf.space"
 PASSING_SCORE = 30
+# GAIA System Prompt - for intelligent reasoning and tool use
+GAIA_SYSTEM_PROMPT = """You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
+IMPORTANT: You have extensive knowledge up to January 2025. For most questions, try to answer from your knowledge FIRST. Only use web_search when:
+1. The question asks for current/recent information (after January 2025)
+2. You're unsure and need to verify facts
+3. The question explicitly asks to search or look up information
+4. The question is about real-time data (weather, stock prices, current events)
+Always use the calculator tool for ANY mathematical computation, even simple ones."""
 def setup_llm():
     """Initialize the best available LLM"""
     if api_key := os.getenv("GROQ_API_KEY"):
         try:
             from llama_index.llms.groq import Groq
             llm = Groq(
                 api_key=api_key,
+                model="llama-3.3-70b-versatile",  # Correct model name
                 temperature=0.0,
                 max_tokens=2048
             )
+            logger.info("✅ Using Groq Llama 3.3 70B")
             return llm
         except Exception as e:
             logger.warning(f"Groq setup failed: {e}")
     if api_key := os.getenv("TOGETHER_API_KEY"):
         try:
+            from llama_index.llms.together import TogetherLLM
+            llm = TogetherLLM(
                 api_key=api_key,
+                model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",  # Correct Together model
                 temperature=0.0,
                 max_tokens=2048
             )
+            logger.info("✅ Using Together AI Llama 3.1 70B")
             return llm
         except Exception as e:
             logger.warning(f"Together setup failed: {e}")
 def extract_final_answer(response_text: str) -> str:
     """Extract answer aligned with GAIA scoring rules"""
+    # Look for FINAL ANSWER pattern
     match = re.search(r"FINAL ANSWER:\s*(.+?)(?:\n|$)", response_text, re.IGNORECASE | re.DOTALL)
     if not match:
+        # Fallback: look for answer at the end of response
+        lines = response_text.strip().split('\n')
+        if lines:
+            # Check if last line looks like an answer
+            last_line = lines[-1].strip()
+            if len(last_line) < 100 and not last_line.startswith(('I', 'The', 'To', 'Based')):
+                answer = last_line
+            else:
+                logger.warning("No FINAL ANSWER found")
+                return ""
+        else:
+            return ""
+    else:
+        answer = match.group(1).strip()
+    # Remove any trailing punctuation that's not part of the answer
+    answer = answer.rstrip('.')
     # Clean for GAIA scoring
+    # 1. Handle numbers with more precision
+    if re.match(r'^[\d\s.,\-+e]+$', answer):
+        # Remove all formatting
+        cleaned = answer.replace(',', '').replace(' ', '')
         try:
+            # Try to parse as float
             num = float(cleaned)
+            # Return integer if whole number, otherwise keep precision
+            if num.is_integer():
+                return str(int(num))
+            else:
+                # Keep original precision, don't round
+                return str(num)
+        except:
+            pass
+    # 2. Handle percentages (remove % sign)
+    if answer.endswith('%'):
+        answer = answer[:-1].strip()
+        try:
+            num = float(answer)
             return str(int(num)) if num.is_integer() else str(num)
         except:
             pass
+    # 3. Lists: clean and standardize
+    if ',' in answer or ' and ' in answer.lower():
+        # Split on commas and 'and'
+        parts = re.split(r',|\s+and\s+', answer)
+        cleaned_parts = []
+        for part in parts:
+            part = part.strip()
+            if not part:
+                continue
             # Try to parse as number
             try:
+                num = float(part.replace('$', '').replace('%', '').replace(',', ''))
+                cleaned_parts.append(str(int(num)) if num.is_integer() else str(num))
             except:
+                # Remove articles from strings
+                words = part.split()
+                if words and words[0].lower() in ['the', 'a', 'an']:
+                    cleaned_parts.append(' '.join(words[1:]))
+                else:
+                    cleaned_parts.append(part)
+        return ', '.join(cleaned_parts)
+    # 4. Yes/No answers
     if answer.lower() in ['yes', 'no']:
         return answer.lower()
+    # 5. Single words/phrases: remove articles
     words = answer.split()
     if words and words[0].lower() in ['the', 'a', 'an']:
         return ' '.join(words[1:])
     return answer
 class GAIAAgent:
     """GAIA RAG Agent using LlamaIndex AgentWorkflow"""
     def __init__(self):
         logger.info("Initializing GAIA RAG Agent...")
+        # Skip persona RAG for faster GAIA evaluation
+        os.environ["SKIP_PERSONA_RAG"] = "true"
         # Initialize LLM
         self.llm = setup_llm()
         """Process a question and return clean answer for course submission"""
         logger.info(f"Processing question: {question[:100]}...")
+        import warnings
+        warnings.filterwarnings("ignore", category=RuntimeWarning, message=".*Event loop is closed.*")
         try:
+            # Create new event loop for async operations
             loop = asyncio.new_event_loop()
             asyncio.set_event_loop(loop)
             try:
                 async def run_agent():
+                    # Track what happened during execution
+                    tool_calls = []
+                    response_chunks = []
+                    try:
+                        # Start the agent workflow
+                        handler = self.agent.run(user_msg=question)
+                        # IMPORTANT: Process events WITHOUT consuming them
+                        # We need to collect BOTH tool usage AND response content
+                        from llama_index.core.agent.workflow import ToolCallResult
+                        # Stream events and collect information
+                        async for event in handler.stream_events():
+                            # Log tool usage
+                            if isinstance(event, ToolCallResult):
+                                tool_info = f"{event.tool_name}: {str(event.result)[:100]}..."
+                                tool_calls.append(tool_info)
+                                logger.info(f"Tool used: {tool_info}")
+                            # Also collect any text responses
+                            # Different event types might have content in different attributes
+                            if hasattr(event, 'delta'):
+                                response_chunks.append(str(event.delta))
+                            elif hasattr(event, 'content'):
+                                response_chunks.append(str(event.content))
+                            elif hasattr(event, 'response'):
+                                response_chunks.append(str(event.response))
+                        # Get the final result after streaming
+                        result = await handler
+                        # Extract the final response text
+                        # Priority: accumulated chunks > result.response > str(result)
+                        if response_chunks:
+                            response_text = ''.join(response_chunks)
+                        elif hasattr(result, 'response'):
+                            response_text = str(result.response)
+                        else:
+                            response_text = str(result)
+                        # Log what tools were used for debugging
+                        if tool_calls:
+                            logger.info(f"Tools used in this query: {', '.join(set(tool_calls))}")
+                        # CRITICAL: Check if we got a meaningful response
+                        # This prevents infinite loops
+                        if not response_text or len(response_text.strip()) < 10:
+                            logger.warning("Got empty or too short response from agent")
+                            # Return a fallback response
+                            return "FINAL ANSWER: Unable to determine answer"
+                        return response_text
+                    except asyncio.TimeoutError:
+                        # Prevent infinite waiting
+                        logger.error("Agent timeout - preventing infinite loop")
+                        return "FINAL ANSWER: Request timeout"
+                    except Exception as e:
+                        logger.error(f"Agent execution error: {e}")
+                        # Return structured error response
+                        return f"FINAL ANSWER: Error occurred"
+                # Run with timeout to prevent infinite loops
+                response_text = loop.run_until_complete(
+                    asyncio.wait_for(run_agent(), timeout=120)  # 2 minute timeout
+                )
+                # Extract clean answer
                 clean_answer = extract_final_answer(response_text)
+                # VALIDATION: Ensure we have a valid answer
+                if not clean_answer:
+                    logger.warning("No answer extracted, using fallback")
+                    # Try to extract any number or short phrase from response
+                    # This prevents returning empty string to GAIA
+                    numbers = re.findall(r'\b\d+\.?\d*\b', response_text)
+                    if numbers:
+                        clean_answer = numbers[-1]  # Use last number found
+                    else:
+                        # Look for any short phrase that could be an answer
+                        sentences = response_text.split('.')
+                        for sent in reversed(sentences):
+                            sent = sent.strip()
+                            if 0 < len(sent) < 50 and not sent.startswith(('I', 'The', 'To')):
+                                clean_answer = sent
+                                break
+                logger.info(f"Full response preview: {response_text[:200]}...")
+                logger.info(f"Extracted answer: '{clean_answer}'")
                 return clean_answer
             finally:
+                # Always close the loop
                 loop.close()
         except Exception as e:
             logger.error(f"Error processing question: {e}")
+            # Never return empty string to GAIA - always return something
+            return "0"  # Safe fallback for math questions
 def run_and_submit_all(profile: gr.OAuthProfile | None):
     """Run GAIA evaluation following course template structure"""
         return error_msg, pd.DataFrame(results_log)
 # Gradio Interface
+with gr.Blocks(title="GAIA RAG Agent - Final Project") as demo:
+    gr.Markdown("# GAIA Smart RAG Agent - Final HF Agents Course Project")
+    gr.Markdown("### by Isadora Teles")
     gr.Markdown("""
+    This is a smart RAG agent for the GAIA benchmark that knows when to use its knowledge vs when to search.
     **Features:**
+    - 🧠 LlamaIndex AgentWorkflow with intelligent reasoning
+    - 💭 Answers from knowledge first (up to Jan 2025)
+    - 🔍 Google Search when needed (with DuckDuckGo fallback)
+    - 🧮 Calculator for all math problems
     - 📊 File analyzer for data questions
     - ✅ Clean answer extraction for exact match
+    **Smart Strategy:**
+    - Uses internal knowledge for facts it knows
+    - Only searches for current info or verification
+    - Prioritizes accuracy and efficiency
     **Instructions:**
     1. Log in with HuggingFace account
     2. Click 'Run Evaluation & Submit All Answers'
+    3. Wait for the agent to process all questions (3-5 minutes)
     4. Check your score!
     """)
     # Check API keys
     api_keys = [
         ("Groq", os.getenv("GROQ_API_KEY")),
+        ("Claude", os.getenv("ANTHROPIC_API_KEY") or os.getenv("CLAUDE_API_KEY")),
         ("Together", os.getenv("TOGETHER_API_KEY")),
         ("HuggingFace", os.getenv("HF_TOKEN")),
         ("OpenAI", os.getenv("OPENAI_API_KEY")),
+        ("Google Search", os.getenv("GOOGLE_API_KEY")),
         ("OpenWeather", os.getenv("OPENWEATHER_API_KEY"))
     ]
     available = [name for name, key in api_keys if key]
     if available:
+        print(f"✅ Available APIs: {', '.join(available)}")
     else:
+        print("❌ No API keys found!")
     print("="*60 + "\n")

test_gaia_agent.py ADDED Viewed

	@@ -0,0 +1,420 @@

+# test_gaia_agent.py
+"""
+Comprehensive test script for GAIA Agent
+Tests LLM, search, tools, and answer extraction
+Run with: python test_gaia_agent.py
+"""
+import os
+import sys
+import logging
+import asyncio
+import json
+from datetime import datetime
+from typing import Dict, List, Tuple
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+# Color codes for terminal output
+class Colors:
+    HEADER = '\033[95m'
+    OKBLUE = '\033[94m'
+    OKCYAN = '\033[96m'
+    OKGREEN = '\033[92m'
+    WARNING = '\033[93m'
+    FAIL = '\033[91m'
+    ENDC = '\033[0m'
+    BOLD = '\033[1m'
+    UNDERLINE = '\033[4m'
+def print_header(text: str):
+    print(f"\n{Colors.HEADER}{Colors.BOLD}{'='*60}{Colors.ENDC}")
+    print(f"{Colors.HEADER}{Colors.BOLD}{text.center(60)}{Colors.ENDC}")
+    print(f"{Colors.HEADER}{Colors.BOLD}{'='*60}{Colors.ENDC}\n")
+def print_test(name: str, status: bool, details: str = ""):
+    status_text = f"{Colors.OKGREEN}✓ PASS{Colors.ENDC}" if status else f"{Colors.FAIL}✗ FAIL{Colors.ENDC}"
+    print(f"{name:<40} {status_text}")
+    if details:
+        print(f"  {Colors.OKCYAN}→ {details}{Colors.ENDC}")
+def print_section(text: str):
+    print(f"\n{Colors.OKBLUE}{Colors.BOLD}{text}{Colors.ENDC}")
+    print(f"{Colors.OKBLUE}{'-'*40}{Colors.ENDC}")
+# Test 1: Environment and API Keys
+def test_environment():
+    print_section("Testing Environment Setup")
+    api_keys = {
+        "GROQ_API_KEY": "Groq (Primary LLM)",
+        "ANTHROPIC_API_KEY": "Anthropic Claude",
+        "TOGETHER_API_KEY": "Together AI",
+        "HF_TOKEN": "HuggingFace",
+        "OPENAI_API_KEY": "OpenAI",
+        "GOOGLE_API_KEY": "Google Search",
+        "GOOGLE_CSE_ID": "Google Custom Search Engine ID"
+    }
+    available = []
+    missing = []
+    for key, service in api_keys.items():
+        if os.getenv(key):
+            available.append(service)
+            print_test(f"{service} API Key", True, f"{key} is set")
+        else:
+            missing.append(service)
+            print_test(f"{service} API Key", False, f"{key} not found")
+    # Set SKIP_PERSONA_RAG for testing
+    os.environ["SKIP_PERSONA_RAG"] = "true"
+    print_test("SKIP_PERSONA_RAG set", True, "Persona RAG disabled for faster testing")
+    return len(available) > 0, available, missing
+# Test 2: LLM Initialization
+def test_llm_setup():
+    print_section("Testing LLM Setup")
+    try:
+        from app import setup_llm
+        llm = setup_llm()
+        print_test("LLM Initialization", True, f"Using {type(llm).__name__}")
+        # Test basic LLM call
+        try:
+            response = llm.complete("Say 'Hello World' and nothing else.")
+            response_text = str(response).strip()
+            success = "hello world" in response_text.lower()
+            print_test("LLM Basic Response", success, f"Response: {response_text[:50]}")
+            return True, llm
+        except Exception as e:
+            print_test("LLM Basic Response", False, f"Error: {str(e)[:100]}")
+            return False, None
+    except Exception as e:
+        print_test("LLM Initialization", False, f"Error: {str(e)[:100]}")
+        return False, None
+# Test 3: Web Search Functions
+def test_web_search():
+    print_section("Testing Web Search")
+    try:
+        from tools import search_web, _search_google, _search_duckduckgo
+        test_query = "Python programming language"
+        # Test Google Search
+        print("\nTesting Google Search...")
+        try:
+            google_result = _search_google(test_query)
+            if google_result and "error" not in google_result.lower():
+                print_test("Google Search", True, f"Got {len(google_result)} chars")
+                print(f"  Preview: {google_result[:150]}...")
+            else:
+                print_test("Google Search", False, google_result[:100])
+        except Exception as e:
+            print_test("Google Search", False, str(e)[:100])
+        # Test DuckDuckGo Search
+        print("\nTesting DuckDuckGo Search...")
+        try:
+            ddg_result = _search_duckduckgo(test_query)
+            if ddg_result and "error" not in ddg_result.lower():
+                print_test("DuckDuckGo Search", True, f"Got {len(ddg_result)} chars")
+                print(f"  Preview: {ddg_result[:150]}...")
+            else:
+                print_test("DuckDuckGo Search", False, ddg_result[:100])
+        except Exception as e:
+            print_test("DuckDuckGo Search", False, str(e)[:100])
+        # Test Combined Search
+        print("\nTesting Combined Web Search...")
+        try:
+            result = search_web(test_query)
+            success = result and len(result) > 50 and "error" not in result.lower()
+            print_test("Combined Web Search", success, f"Got {len(result)} chars")
+            return success
+        except Exception as e:
+            print_test("Combined Web Search", False, str(e)[:100])
+            return False
+    except ImportError as e:
+        print_test("Import Tools Module", False, str(e))
+        return False
+# Test 4: Other Tools
+def test_tools():
+    print_section("Testing Other Tools")
+    try:
+        from tools import calculate, analyze_file, get_weather
+        # Test Calculator
+        calc_tests = [
+            ("2 + 2", "4"),
+            ("15% of 1000", "150"),
+            ("square root of 144", "12"),
+            ("4847 * 3291", "15951477") ,
+        ]
+        calc_success = 0
+        for expr, expected in calc_tests:
+            try:
+                result = calculate(expr)
+                success = str(result) == expected
+                calc_success += success
+                print_test(f"Calculate: {expr}", success, f"Got {result}, expected {expected}")
+            except Exception as e:
+                print_test(f"Calculate: {expr}", False, str(e)[:50])
+        # Test File Analyzer
+        try:
+            csv_content = "name,age,score\nAlice,25,85\nBob,30,92"
+            result = analyze_file(csv_content, "csv")
+            success = "3" in result and "name" in result
+            print_test("File Analyzer (CSV)", success, "Basic CSV analysis works")
+        except Exception as e:
+            print_test("File Analyzer (CSV)", False, str(e)[:50])
+        # Test Weather
+        try:
+            result = get_weather("Paris")
+            success = "Temperature" in result and "°C" in result
+            print_test("Weather Tool", success, result.split('\n')[0])
+        except Exception as e:
+            print_test("Weather Tool", False, str(e)[:50])
+        return calc_success >= 3
+    except ImportError as e:
+        print_test("Import Tools", False, str(e))
+        return False
+# Test 5: Answer Extraction
+def test_answer_extraction():
+    print_section("Testing Answer Extraction")
+    try:
+        # Try importing just the function we need
+        import sys
+        import os
+        sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+        # Import the extract_final_answer function directly
+        from app import extract_final_answer
+        test_cases = [
+            # (input, expected)
+            ("The answer is 42. FINAL ANSWER: 42", "42"),
+            ("FINAL ANSWER: 15%", "15"),
+            ("Calculating... FINAL ANSWER: 3,456", "3456"),
+            ("FINAL ANSWER: Paris", "Paris"),
+            ("FINAL ANSWER: The Eiffel Tower", "Eiffel Tower"),
+            ("FINAL ANSWER: yes", "yes"),
+            ("FINAL ANSWER: 1, 2, 3, 4, 5", "1, 2, 3, 4, 5"),
+            ("Some text FINAL ANSWER: $1,234.56", "1234.56"),
+            ("No final answer marker here", ""),
+        ]
+        success_count = 0
+        for input_text, expected in test_cases:
+            result = extract_final_answer(input_text)
+            success = result == expected
+            success_count += success
+            print_test(
+                f"Extract: {expected or '(empty)'}",
+                success,
+                f"Got '{result}'" if not success else ""
+            )
+        return success_count >= len(test_cases) - 2
+    except ImportError as e:
+        # If import fails, try a minimal test
+        print_test("Answer Extraction Import", False, f"Import error: {str(e)[:100]}")
+        # Create a minimal version for testing
+        def extract_final_answer_minimal(text):
+            import re
+            match = re.search(r"FINAL ANSWER:\s*(.+?)(?:\n|$)", text, re.IGNORECASE)
+            return match.group(1).strip() if match else ""
+        # Test with minimal version
+        test_text = "The answer is FINAL ANSWER: 42"
+        result = extract_final_answer_minimal(test_text)
+        success = result == "42"
+        print_test("Minimal Extraction Test", success, f"Got '{result}'")
+        return success
+    except Exception as e:
+        print_test("Answer Extraction", False, str(e))
+        return False
+# Test 6: Full Agent Test
+def test_gaia_agent(llm):
+    print_section("Testing GAIA Agent")
+    try:
+        # Import here to ensure environment is set up
+        from app import GAIAAgent
+        # Initialize agent
+        print("Initializing GAIA Agent...")
+        agent = GAIAAgent()
+        print_test("Agent Initialization", True, "Agent created successfully")
+        # Test questions matching GAIA style
+        test_questions = [
+            # (question, expected_answer_pattern, description)
+            ("What is 2 + 2?", r"^4$", "Simple math"),
+            ("Calculate 15% of 1200", r"^180$", "Percentage calculation"),
+            ("What is the capital of France?", r"(?i)paris", "Factual question"),
+            ("Is 17 a prime number? Answer yes or no.", r"(?i)yes", "Yes/no question"),
+            ("List the first 3 prime numbers", r"2.*3.*5", "List question"),
+        ]
+        print("\nRunning test questions...")
+        success_count = 0
+        for question, pattern, description in test_questions:
+            print(f"\n{Colors.BOLD}Q: {question}{Colors.ENDC}")
+            try:
+                answer = agent(question)
+                print(f"A: '{answer}'")
+                import re
+                matches = bool(re.search(pattern, answer))
+                success_count += matches
+                print_test(f"{description}", matches,
+                          f"Expected pattern: {pattern}" if not matches else "")
+            except Exception as e:
+                print_test(f"{description}", False, f"Error: {str(e)[:50]}")
+                print(f"{Colors.WARNING}Full error: {e}{Colors.ENDC}")
+        return success_count >= 3
+    except Exception as e:
+        print_test("GAIA Agent", False, f"Error: {str(e)}")
+        import traceback
+        print(f"{Colors.WARNING}Full traceback:{Colors.ENDC}")
+        traceback.print_exc()
+        return False
+# Test 7: GAIA API Integration
+def test_gaia_api():
+    print_section("Testing GAIA API Connection")
+    try:
+        import requests
+        from app import GAIA_API_URL
+        # Test questions endpoint
+        try:
+            response = requests.get(f"{GAIA_API_URL}/questions", timeout=10)
+            if response.status_code == 200:
+                questions = response.json()
+                print_test("GAIA API Questions", True, f"Got {len(questions)} questions")
+                # Show sample question
+                if questions:
+                    sample = questions[0]
+                    print(f"  Sample task_id: {sample.get('task_id', 'N/A')}")
+                    q_text = sample.get('question', '')[:100]
+                    print(f"  Sample question: {q_text}...")
+                return True
+            else:
+                print_test("GAIA API Questions", False, f"HTTP {response.status_code}")
+                return False
+        except Exception as e:
+            print_test("GAIA API Questions", False, str(e)[:100])
+            return False
+    except Exception as e:
+        print_test("GAIA API Test", False, str(e))
+        return False
+# Main test runner
+def main():
+    print_header("GAIA Agent Local Test Suite")
+    # Track overall results
+    results = {
+        "Environment": False,
+        "LLM": False,
+        "Web Search": False,
+        "Tools": False,
+        "Answer Extraction": False,
+        "Agent": False,
+        "API": False
+    }
+    # Run tests
+    env_ok, available, missing = test_environment()
+    results["Environment"] = env_ok
+    if not env_ok:
+        print(f"\n{Colors.FAIL}No API keys found! Please set at least one of:{Colors.ENDC}")
+        for m in missing:
+            print(f"  - {m}")
+        print("\nExample:")
+        print("  export GROQ_API_KEY='your-key-here'")
+        return
+    # Test LLM
+    llm_ok, llm = test_llm_setup()
+    results["LLM"] = llm_ok
+    # Test other components
+    results["Web Search"] = test_web_search()
+    results["Tools"] = test_tools()
+    results["Answer Extraction"] = test_answer_extraction()
+    # Only test agent if LLM works
+    if llm_ok:
+        results["Agent"] = test_gaia_agent(llm)
+    # Test API connection
+    results["API"] = test_gaia_api()
+    # Summary
+    print_header("Test Summary")
+    passed = sum(1 for v in results.values() if v)
+    total = len(results)
+    for component, status in results.items():
+        print_test(component, status)
+    print(f"\n{Colors.BOLD}Overall: {passed}/{total} components working{Colors.ENDC}")
+    if passed == total:
+        print(f"{Colors.OKGREEN}✨ All tests passed! Your agent is ready for GAIA evaluation.{Colors.ENDC}")
+    elif passed >= total - 2:
+        print(f"{Colors.WARNING}⚠️  Most components working. Check failed components above.{Colors.ENDC}")
+    else:
+        print(f"{Colors.FAIL}❌ Several components failing. Fix issues before running GAIA evaluation.{Colors.ENDC}")
+    # Recommendations
+    if not results["Web Search"]:
+        print(f"\n{Colors.WARNING}Tip: Web search is important for GAIA. Check your GOOGLE_API_KEY.{Colors.ENDC}")
+    if not results["Agent"]:
+        print(f"\n{Colors.WARNING}Tip: Agent not working. Check LLM setup and tool integration.{Colors.ENDC}")
+if __name__ == "__main__":
+    main()

test_google_search.py ADDED Viewed

	@@ -0,0 +1,143 @@

+#!/usr/bin/env python3
+"""
+Quick test for Google Search functionality
+Run this to verify your Google API key and CSE ID are working
+"""
+import os
+import sys
+import requests
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def test_google_search():
+    """Test Google Custom Search API"""
+    print("🔍 Testing Google Search Configuration\n")
+    # Check for API key
+    api_key = os.getenv("GOOGLE_API_KEY")
+    if not api_key:
+        print("❌ GOOGLE_API_KEY not found in environment")
+        print("   Set it with: export GOOGLE_API_KEY=your_key_here")
+        return False
+    print("✅ Google API key found")
+    # CSE ID (yours or from env)
+    cse_id = os.getenv("GOOGLE_CSE_ID", "746382dd3c2bd4135")
+    print(f"✅ Using CSE ID: {cse_id}")
+    # Test query
+    test_query = "GAIA benchmark AI"
+    print(f"\nTesting search for: '{test_query}'")
+    # Make API call
+    url = "https://www.googleapis.com/customsearch/v1"
+    params = {
+        "key": api_key,
+        "cx": cse_id,
+        "q": test_query,
+        "num": 3
+    }
+    try:
+        print("Calling Google API...")
+        response = requests.get(url, params=params, timeout=10)
+        print(f"Response status: {response.status_code}")
+        if response.status_code == 200:
+            data = response.json()
+            # Check search info
+            search_info = data.get("searchInformation", {})
+            total_results = search_info.get("totalResults", "0")
+            search_time = search_info.get("searchTime", "0")
+            print(f"\n✅ Search successful!")
+            print(f"   Total results: {total_results}")
+            print(f"   Search time: {search_time}s")
+            # Show results
+            items = data.get("items", [])
+            if items:
+                print(f"\nFound {len(items)} results:")
+                for i, item in enumerate(items, 1):
+                    print(f"\n{i}. {item.get('title', 'No title')}")
+                    print(f"   {item.get('snippet', 'No snippet')[:100]}...")
+                    print(f"   {item.get('link', 'No link')}")
+            else:
+                print("\n⚠️  No results returned (but API is working)")
+            # Check quota
+            if "queries" in data:
+                queries = data["queries"]["request"][0]
+                print(f"\n📊 API Usage:")
+                print(f"   Results returned: {queries.get('count', 'unknown')}")
+                print(f"   Total results: {queries.get('totalResults', 'unknown')}")
+            return True
+        else:
+            # Error response
+            print(f"\n❌ API Error (HTTP {response.status_code})")
+            try:
+                error_data = response.json()
+                error = error_data.get("error", {})
+                print(f"   Code: {error.get('code', 'unknown')}")
+                print(f"   Message: {error.get('message', 'unknown')}")
+                # Common errors
+                if response.status_code == 403:
+                    print("\n🔧 Possible fixes:")
+                    print("   1. Check your API key is correct")
+                    print("   2. Enable 'Custom Search API' in Google Cloud Console")
+                    print("   3. Check your quota hasn't been exceeded")
+                elif response.status_code == 400:
+                    print("\n🔧 Possible fixes:")
+                    print("   1. Check your CSE ID is correct")
+                    print("   2. Verify your search engine is set up properly")
+            except:
+                print(f"   Raw response: {response.text[:200]}")
+            return False
+    except requests.exceptions.Timeout:
+        print("\n❌ Request timed out")
+        return False
+    except requests.exceptions.ConnectionError:
+        print("\n❌ Connection error - check your internet")
+        return False
+    except Exception as e:
+        print(f"\n❌ Unexpected error: {type(e).__name__}: {e}")
+        return False
+def main():
+    """Run the test"""
+    print("="*60)
+    print("Google Custom Search API Test")
+    print("="*60)
+    success = test_google_search()
+    print("\n" + "="*60)
+    if success:
+        print("✅ Google Search is working correctly!")
+        print("Your GAIA agent should be able to search the web.")
+    else:
+        print("❌ Google Search is not working")
+        print("Fix the issues above before running the GAIA agent.")
+        print("\nThe agent will fall back to DuckDuckGo if available.")
+    print("="*60)
+    return 0 if success else 1
+if __name__ == "__main__":
+    sys.exit(main())

test_local.py DELETED Viewed

@@ -1,216 +0,0 @@
-"""
-Test GAIA Agent Locally
-Complete testing script for your GAIA RAG agent
-"""
-import os
-import json
-import asyncio
-from app import GAIAAgent
-def test_gaia_agent():
-    """Test the GAIA agent with sample questions"""
-    print("🧪 Testing GAIA RAG Agent\n")
-    # Check API keys
-    api_keys = {
-        "Claude": os.getenv("ANTHROPIC_API_KEY") or os.getenv("CLAUDE_API_KEY"),
-        "Groq": os.getenv("GROQ_API_KEY"),
-        "Together": os.getenv("TOGETHER_API_KEY"),
-        "HuggingFace": os.getenv("HF_TOKEN"),
-        "OpenAI": os.getenv("OPENAI_API_KEY")
-    }
-    available = [name for name, key in api_keys.items() if key]
-    if not available:
-        print("❌ No API keys found!")
-        print("Set one of these environment variables:")
-        print("  export GROQ_API_KEY=your_key")
-        print("  export ANTHROPIC_API_KEY=your_key")
-        print("  export TOGETHER_API_KEY=your_key")
-        print("  export HF_TOKEN=your_key")
-        return
-    print(f"✅ Available LLMs: {', '.join(available)}\n")
-    # GAIA-style test questions
-    test_questions = [
-        {"task_id": "test_001", "question": "What is 25 * 17?"},
-        {"task_id": "test_002", "question": "What is the opposite of left?"},
-        {"task_id": "test_003", "question": "How many planets are in our solar system?"},
-        {"task_id": "test_004", "question": "Is Paris the capital of France?"},
-        {"task_id": "test_005", "question": "What is 15% of 1000?"},
-        {"task_id": "test_006", "question": "List the primary colors"},
-        {"task_id": "test_007", "question": "What is the square root of 144?"},
-        {"task_id": "test_008", "question": "How many days are in a week?"}
-    ]
-    # Initialize agent
-    try:
-        print("Initializing GAIA agent...")
-        agent = GAIAAgent()
-        print("✅ Agent ready!\n")
-    except Exception as e:
-        print(f"❌ Failed to create agent: {e}")
-        return
-    # Test each question
-    answers_for_submission = []
-    correct_count = 0
-    print("Running test questions:\n")
-    print("-" * 60)
-    for item in test_questions:
-        task_id = item["task_id"]
-        question = item["question"]
-        print(f"Q: {question}")
-        try:
-            # Get answer
-            answer = agent(question)
-            # Format for submission
-            answers_for_submission.append({
-                "task_id": task_id,
-                "submitted_answer": answer
-            })
-            print(f"A: {answer}")
-            # Check against expected answers
-            expected = get_expected_answer(question)
-            if expected and answer == expected:
-                print("✅ Correct!")
-                correct_count += 1
-            elif expected:
-                print(f"❌ Expected: {expected}")
-            print("-" * 60)
-        except Exception as e:
-            print(f"Error: {e}")
-            answers_for_submission.append({
-                "task_id": task_id,
-                "submitted_answer": ""
-            })
-            print("-" * 60)
-    # Show submission format
-    print("\n" + "="*60)
-    print("SUBMISSION FORMAT (what gets sent to GAIA):")
-    print(json.dumps(answers_for_submission, indent=2))
-    # Save to file
-    with open("test_submission.json", "w") as f:
-        json.dump(answers_for_submission, f, indent=2)
-    print("\n✅ Saved to test_submission.json")
-    # Summary
-    print(f"\nTest Results: {correct_count}/{len(test_questions)} correct")
-    print(f"Expected score: {correct_count/len(test_questions)*100:.1f}%")
-def get_expected_answer(question):
-    """Get expected answer for test questions"""
-    expected = {
-        "What is 25 * 17?": "425",
-        "What is the opposite of left?": "right",
-        "How many planets are in our solar system?": "8",
-        "Is Paris the capital of France?": "yes",
-        "What is 15% of 1000?": "150",
-        "List the primary colors": "red, blue, yellow",
-        "What is the square root of 144?": "12",
-        "How many days are in a week?": "7"
-    }
-    return expected.get(question)
-def test_tools_only():
-    """Test individual tools"""
-    print("\n🔧 Testing Individual Tools\n")
-    from tools import calculate, search_web, analyze_file, get_weather
-    # Test calculator
-    print("Calculator Tests:")
-    test_calcs = [
-        ("10 + 10", "20"),
-        ("sqrt(144)", "12"),
-        ("15% of 1000", "150"),
-        ("25 * 17", "425")
-    ]
-    for expr, expected in test_calcs:
-        result = calculate(expr)
-        status = "✅" if result == expected else "❌"
-        print(f"  {status} {expr} = {result} (expected: {expected})")
-    # Test file analyzer
-    print("\nFile Analyzer Test:")
-    csv_data = "product,price,quantity\nApple,1.50,100\nBanana,0.80,150"
-    result = analyze_file(csv_data, "csv")
-    print(result)
-    # Test weather
-    print("\nWeather Test:")
-    result = get_weather("New York")
-    print(result)
-    # Test web search (if available)
-    print("\nWeb Search Test:")
-    try:
-        result = search_web("capital of France")
-        print(f"Found: {result[:200]}...")
-    except Exception as e:
-        print(f"Web search not available: {e}")
-def test_answer_extraction():
-    """Test GAIA-compliant answer extraction"""
-    print("\n📝 Testing Answer Extraction\n")
-    from app import extract_final_answer
-    test_cases = [
-        ("I calculated it.\n\nFINAL ANSWER: 425", "425"),
-        ("The answer is:\n\nFINAL ANSWER: $1,500", "1500"),
-        ("After analysis:\n\nFINAL ANSWER: yes", "yes"),
-        ("The result:\n\nFINAL ANSWER: red, blue, yellow", "red, blue, yellow"),
-        ("FINAL ANSWER: The Paris", "Paris"),
-        ("FINAL ANSWER: 25%", "25")
-    ]
-    print("Testing GAIA answer extraction:")
-    for response, expected in test_cases:
-        extracted = extract_final_answer(response)
-        status = "✅" if extracted == expected else "❌"
-        print(f"{status} '{response[:30]}...' → '{extracted}' (expected: '{expected}')")
-def main():
-    """Run all tests"""
-    print("="*60)
-    print("GAIA RAG Agent - Complete Testing Suite")
-    print("="*60)
-    # Test components
-    test_answer_extraction()
-    test_tools_only()
-    # Test full agent
-    print("\n" + "="*60)
-    test_gaia_agent()
-    print("\n✅ Testing complete!")
-    print("\nNext steps:")
-    print("1. Review test_submission.json")
-    print("2. Fix any failing tests")
-    print("3. Deploy to HuggingFace Space")
-    print("4. Run the real GAIA evaluation")
-if __name__ == "__main__":
-    main()

tools.py CHANGED Viewed

@@ -4,55 +4,243 @@ Includes web search, calculator, file analyzer, weather, and persona RAG
 """
 import os
 import logging
 import math
 import re
 from typing import List, Optional
 from llama_index.core.tools import FunctionTool, QueryEngineTool
 logger = logging.getLogger(__name__)
 # ==========================================
-# Core Tool Functions
 # ==========================================
 def search_web(query: str) -> str:
     """
-    Search the web for current information using DuckDuckGo.
-    Returns concise, relevant results.
     """
-    logger.info(f"Searching web for: {query}")
     try:
-        from duckduckgo_search import DDGS
-        with DDGS() as ddgs:
-            results = list(ddgs.text(query, max_results=3))
-            if not results:
-                return "No search results found."
-            # Format results concisely for GAIA
-            formatted_results = []
-            for i, result in enumerate(results, 1):
-                title = result.get('title', '')
-                body = result.get('body', '')
-                url = result.get('href', '')
-                # Clean and truncate body
-                clean_body = ' '.join(body.split())[:200]
-                formatted_results.append(f"{i}. {title}\n{clean_body}\nSource: {url}")
-            return "\n\n".join(formatted_results)
     except ImportError:
         logger.error("duckduckgo_search not installed")
-        return "Web search unavailable - package not installed"
     except Exception as e:
-        logger.error(f"Search error: {e}")
-        return f"Search failed: {str(e)}"
 def calculate(expression: str) -> str:
     """
@@ -80,6 +268,14 @@ def calculate(expression: str) -> str:
                 result = (percentage / 100) * number
                 return str(int(result) if result.is_integer() else round(result, 6))
         # Handle word numbers
         word_to_num = {
             'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4',
@@ -94,12 +290,12 @@ def calculate(expression: str) -> str:
         for word, num in word_to_num.items():
             expr = re.sub(rf'\b{word}\b', num, expr, flags=re.IGNORECASE)
-        # Replace math words
         math_replacements = {
             r'\bplus\b': '+', r'\bminus\b': '-', r'\btimes\b': '*',
             r'\bmultiplied by\b': '*', r'\bdivided by\b': '/', r'\bover\b': '/',
             r'\bsquared\b': '**2', r'\bcubed\b': '**3',
-            r'\bto the power of\b': '**', r'\bsquare root of\b': 'sqrt'
         }
         for pattern, replacement in math_replacements.items():
@@ -132,7 +328,8 @@ def calculate(expression: str) -> str:
     except Exception as e:
         logger.error(f"Calculation error: {e}")
         return "0"
 def analyze_file(content: str, file_type: str = "text") -> str:
     """
     Analyze file contents, especially CSV files.
@@ -341,6 +538,10 @@ def create_simple_persona_engine(llm):
 # Tool Creation
 # ==========================================
 def get_gaia_tools(llm=None):
     """
     Get all tools needed for GAIA evaluation.
@@ -355,12 +556,18 @@ def get_gaia_tools(llm=None):
         FunctionTool.from_defaults(
             fn=search_web,
             name="web_search",
-            description="Search the web for current information, facts, news, or any data not in the knowledge base. Use for questions requiring up-to-date information."
         ),
         FunctionTool.from_defaults(
             fn=calculate,
             name="calculator",
-            description="Perform mathematical calculations including arithmetic, percentages, and advanced math functions. ALWAYS use this for ANY mathematical computation."
         ),
         FunctionTool.from_defaults(
             fn=analyze_file,
@@ -370,7 +577,7 @@ def get_gaia_tools(llm=None):
         FunctionTool.from_defaults(
             fn=get_weather,
             name="weather",
-            description="Get current weather information for any location."
         )
     ]

 """
 import os
+import requests
 import logging
 import math
 import re
 from typing import List, Optional
 from llama_index.core.tools import FunctionTool, QueryEngineTool
+# Set up better logging
 logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
 # ==========================================
+# Web Search Functions
 # ==========================================
 def search_web(query: str) -> str:
     """
+    Search the web for current information, verification, or when explicitly needed.
+    Prioritizes Google Search, then DuckDuckGo as fallback.
     """
+    logger.info(f"Web search requested for: {query}")
+    # Try Google Custom Search first
+    google_result = _search_google(query)
+    if google_result and not google_result.startswith("Google search"):
+        logger.info("Google search successful")
+        return google_result
+    # Fallback to DuckDuckGo
+    logger.info("Trying DuckDuckGo as fallback...")
+    ddg_result = _search_duckduckgo(query)
+    if ddg_result and not ddg_result.startswith("DuckDuckGo"):
+        return ddg_result
+    # If all searches fail
+    logger.warning("All web search methods failed")
+    return f"Web search unavailable. Please answer based on knowledge up to January 2025."
+def _search_google(query: str) -> str:
+    """Search using Google Custom Search API"""
+    api_key = os.getenv("GOOGLE_API_KEY")
+    # Use the provided CSE ID or fall back to environment variable
+    cx = os.getenv("GOOGLE_CSE_ID", "746382dd3c2bd4135")  # Your custom search engine ID
+    if not api_key:
+        logger.info("Google API key not found")
+        return "Google search not configured - no API key"
     try:
+        url = "https://www.googleapis.com/customsearch/v1"
+        params = {
+            "key": api_key,
+            "cx": cx,
+            "q": query,
+            "num": 5  # Get more results for better coverage
+        }
+        logger.info(f"Calling Google Search API for: {query}")
+        logger.debug(f"Using CSE ID: {cx}")
+        response = requests.get(url, params=params, timeout=10)
+        # Log response status for debugging
+        logger.info(f"Google API response status: {response.status_code}")
+        if response.status_code != 200:
+            error_data = response.json() if response.text else {}
+            error_msg = error_data.get('error', {}).get('message', 'Unknown error')
+            logger.error(f"Google API error: {error_msg}")
+            if response.status_code == 403:
+                return "Google search quota exceeded or API key invalid"
+            elif response.status_code == 400:
+                return f"Google search configuration error: {error_msg}"
+            else:
+                return f"Google search error (HTTP {response.status_code}): {error_msg}"
+        response.raise_for_status()
+        data = response.json()
+        items = data.get("items", [])
+        # Check if search returned results
+        total_results = data.get("searchInformation", {}).get("totalResults", "0")
+        logger.info(f"Google found {total_results} total results, returning {len(items)}")
+        if not items:
+            logger.warning("No Google search results found")
+            return "No Google search results found for this query"
+        # Format results with more context
+        formatted_results = []
+        for i, item in enumerate(items[:3], 1):
+            title = item.get("title", "")
+            snippet = item.get("snippet", "")
+            link = item.get("link", "")
+            # Clean up snippet
+            snippet = ' '.join(snippet.split())
+            formatted_results.append(f"{i}. {title}\n{snippet}\nSource: {link}")
+        return "\n\n".join(formatted_results)
+    except requests.exceptions.HTTPError as e:
+        logger.error(f"Google API HTTP error: {e}")
+        return f"Google search HTTP error: {e.response.status_code}"
+    except requests.exceptions.Timeout:
+        logger.error("Google API timeout")
+        return "Google search timeout - try again"
+    except requests.exceptions.ConnectionError:
+        logger.error("Google API connection error")
+        return "Google search connection error"
+    except Exception as e:
+        logger.error(f"Google search unexpected error: {type(e).__name__}: {e}")
+        return f"Google search failed: {str(e)[:100]}"
+def _search_duckduckgo(query: str) -> str:
+    """Search using DuckDuckGo with robust error handling"""
+    try:
+        from duckduckgo_search import DDGS
+        logger.info(f"Trying DuckDuckGo search for: {query}")
+        # Try with timeout and different methods
+        try:
+            with DDGS(timeout=10) as ddgs:
+                results = []
+                # Try instant answers first (often more reliable)
+                try:
+                    instant = ddgs.answers(query)
+                    if instant:
+                        for answer in instant[:1]:  # Just take first answer
+                            if answer.get('text'):
+                                results.append({
+                                    'title': 'Quick Answer',
+                                    'body': answer['text'],
+                                    'href': answer.get('url', 'DuckDuckGo Instant Answer')
+                                })
+                except:
+                    pass
+                # Then try text search
+                try:
+                    # Try lite backend first (more reliable in HF Spaces)
+                    text_results = list(ddgs.text(query, max_results=3, backend="lite"))
+                    results.extend(text_results)
+                except:
+                    # Fallback to API backend
+                    try:
+                        text_results = list(ddgs.text(query, max_results=3, backend="api"))
+                        results.extend(text_results)
+                    except:
+                        pass
+                if not results:
+                    logger.warning("No DuckDuckGo results found")
+                    return "No DuckDuckGo results found"
+                # Format results
+                formatted_results = []
+                for i, result in enumerate(results[:3], 1):
+                    title = result.get('title', '')
+                    body = result.get('body', '')
+                    url = result.get('href', '')
+                    # Clean body text
+                    clean_body = ' '.join(body.split())[:200]
+                    if len(body) > 200:
+                        clean_body += "..."
+                    formatted_results.append(f"{i}. {title}\n{clean_body}\nSource: {url}")
+                logger.info(f"DuckDuckGo returned {len(results)} results")
+                return "\n\n".join(formatted_results)
+        except Exception as e:
+            logger.warning(f"DuckDuckGo DDGS method failed: {e}")
+            # Fallback to direct API call (doesn't require auth)
+            import requests
+            response = requests.get(
+                "https://api.duckduckgo.com/",
+                params={
+                    "q": query,
+                    "format": "json",
+                    "no_html": "1",
+                    "skip_disambig": "1"
+                },
+                timeout=5
+            )
+            if response.status_code == 200:
+                data = response.json()
+                results = []
+                # Get instant answer
+                if data.get("AbstractText"):
+                    results.append(
+                        f"1. Quick Answer\n{data['AbstractText']}\n"
+                        f"Source: {data.get('AbstractURL', 'DuckDuckGo')}"
+                    )
+                # Get definition if available
+                if data.get("Definition"):
+                    results.append(
+                        f"{len(results)+1}. Definition\n{data['Definition']}\n"
+                        f"Source: {data.get('DefinitionURL', 'DuckDuckGo')}"
+                    )
+                # Get answer if available
+                if data.get("Answer"):
+                    results.append(
+                        f"{len(results)+1}. Answer\n{data['Answer']}\n"
+                        f"Source: DuckDuckGo Instant Answer"
+                    )
+                if results:
+                    return "\n\n".join(results)
+                else:
+                    return "DuckDuckGo API returned no results"
+            else:
+                return f"DuckDuckGo API error: HTTP {response.status_code}"
     except ImportError:
         logger.error("duckduckgo_search not installed")
+        return "DuckDuckGo search unavailable - package not installed"
     except Exception as e:
+        logger.error(f"DuckDuckGo search error: {e}")
+        return f"DuckDuckGo search failed: {str(e)[:100]}"
+# ==========================================
+# Core Tool Functions
+# ==========================================
 def calculate(expression: str) -> str:
     """
                 result = (percentage / 100) * number
                 return str(int(result) if result.is_integer() else round(result, 6))
+        # Handle square root BEFORE other replacements
+        if 'square root' in expr.lower():
+            match = re.search(r'square root of\s*(\d+(?:\.\d+)?)', expr, re.IGNORECASE)
+            if match:
+                number = float(match.group(1))
+                result = math.sqrt(number)
+                return str(int(result) if result.is_integer() else result)
         # Handle word numbers
         word_to_num = {
             'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4',
         for word, num in word_to_num.items():
             expr = re.sub(rf'\b{word}\b', num, expr, flags=re.IGNORECASE)
+        # Replace math words (but NOT square root anymore since we handled it)
         math_replacements = {
             r'\bplus\b': '+', r'\bminus\b': '-', r'\btimes\b': '*',
             r'\bmultiplied by\b': '*', r'\bdivided by\b': '/', r'\bover\b': '/',
             r'\bsquared\b': '**2', r'\bcubed\b': '**3',
+            r'\bto the power of\b': '**'
         }
         for pattern, replacement in math_replacements.items():
     except Exception as e:
         logger.error(f"Calculation error: {e}")
         return "0"
 def analyze_file(content: str, file_type: str = "text") -> str:
     """
     Analyze file contents, especially CSV files.
 # Tool Creation
 # ==========================================
+def get_my_tools(llm=None):
+    """Get all tools for the GAIA agent (alias maintained for compatibility)"""
+    return get_gaia_tools(llm)
 def get_gaia_tools(llm=None):
     """
     Get all tools needed for GAIA evaluation.
         FunctionTool.from_defaults(
             fn=search_web,
             name="web_search",
+            description="""Use ONLY for:
+        1. Current events after January 2025
+        2. Real-time data (stock prices, weather, sports scores)
+        3. When question explicitly asks to "search" or "look up"
+        4. To verify facts you're uncertain about
+        Do NOT use for general knowledge, historical facts, or math."""
         ),
         FunctionTool.from_defaults(
             fn=calculate,
             name="calculator",
+            description="ALWAYS use for ANY math calculation, including simple arithmetic like 2+2. Required for all numbers."
         ),
         FunctionTool.from_defaults(
             fn=analyze_file,
         FunctionTool.from_defaults(
             fn=get_weather,
             name="weather",
+            description="Get current weather information for any location. Use when asked about weather conditions."
         )
     ]