--- title: GAIA Benchmark Agent emoji: 🕵🏻‍♂️ colorFrom: indigo colorTo: indigo sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false hf_oauth: true hf_oauth_expiration_minutes: 480 --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # GAIA Benchmark Agent A LangGraph-based AI agent designed to solve questions from the GAIA (General AI Assistants) benchmark. This agent uses Google's Gemini model with custom tools for web search, file processing, and multimodal analysis to answer complex questions requiring reasoning and information gathering. ## Features - **LangGraph Architecture**: Implements a state-graph agent workflow with tool calling capabilities - **Multimodal Capabilities**: - Image analysis (PNG, JPG, JPEG, GIF, WebP, BMP) - YouTube video analysis and transcript extraction - Audio transcription (MP3) - PDF and Excel file processing - **Web Research Tools**: - DuckDuckGo web search - Wikipedia integration - ArXiv academic paper search - Web page content extraction - **Mathematical Operations**: Basic arithmetic and modulus operations - **Gradio Interface**: User-friendly web UI for testing and evaluation - **Automated Evaluation**: Fetches questions from API, processes them, and submits answers - **Observability**: Built-in integration with Langfuse for tracking traces and metrics ## Project Structure ``` GAIA_Benchmark_Agent/ ├── app.py # Main application entry point ├── agents.py # LangGraph agent implementation ├── custom_tools.py # Tool definitions for web search, files, etc. ├── system_prompt.py # Agent system prompt and instructions ├── gradioapp.py # Gradio UI components ├── requirements.txt # Python dependencies └── files/ └── metadata.jsonl # Ground truth data for local testing ``` ## Installation 1. Clone the repository: ```bash git clone https://github.com/yourusername/GAIA_Benchmark_Agent.git cd GAIA_Benchmark_Agent ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Set up environment variables: ```bash export GOOGLE_API_KEY="your_google_api_key" export HUGGINGFACEHUB_API_TOKEN="your_hf_token" # Optional. not yet used # Langfuse Observability (Optional) export LANGFUSE_PUBLIC_KEY="pk-lf-..." export LANGFUSE_SECRET_KEY="sk-lf-..." export LANGFUSE_HOST="https://cloud.langfuse.com" # Optional ``` ## Requirements - Python 3.8+ - Google API Key (for Gemini model) - ffmpeg (optional, for audio processing) ### Key Dependencies - `langchain-core`, `langgraph` - Agent framework - `langchain-google-genai` - Google Gemini integration - `gradio` - Web UI - `requests`, `beautifulsoup4` - Web scraping - `pypdf`, `pandas` - File processing - `youtube-transcript-api` - YouTube integration - `ddgs` - DuckDuckGo search ## Usage ### Running the Gradio Interface Launch the web interface for interactive testing: ```bash python app.py ``` This will start a Gradio app where you can: - Log in with your Hugging Face account - Run evaluation on all questions - Test individual questions - View results and scores ### Running Local Tests Test the agent on specific questions without the web interface: ```bash python app.py --test ``` Edit the question indices in [app.py:196](app.py#L196) to customize which questions to test. ### Using the Agent Programmatically ```python from agents import MyGAIAAgents # Initialize agent (automatically uses ACTIVE_AGENT from config) agent = MyGAIAAgents() # Ask a question answer = agent("What is the capital of France?") print(answer) # Ask a question with a file reference answer = agent( "What data is in this spreadsheet?", file_name="data.xlsx" ) print(answer) ``` ## How It Works ### Agent Architecture The agent is built using LangGraph with the following workflow: 1. **Initialize**: Loads the question and system prompt 2. **Assistant Node**: Calls the LLM (Gemini) to decide on tool usage 3. **Tool Node**: Executes requested tools (search, file reading, etc.) 4. **Iteration**: Loops between assistant and tools until answer is found 5. **Termination**: Returns final answer or hits step limit (25 steps max) ### Available Tools **Search & Research:** - `websearch` - DuckDuckGo web search - `wiki_search` - Wikipedia articles - `arvix_search` - Academic papers - `get_webpage_content` - Extract webpage text - `get_youtube_transcript` - YouTube video transcripts - `analyze_youtube_video` - AI analysis of YouTube videos **File Processing:** - `read_excel_file` - Read Excel spreadsheets - `read_python_script` - Read Python source code - `parse_audio_file` - Transcribe MP3 files - `analyze_image` - AI vision analysis of images **Utilities:** - Math operations: `add`, `subtract`, `multiply`, `divide`, `power`, `modulus` - `string_reverse` - Reverse encoded/gibberish text - `get_current_time_in_timezone` - Get time in any timezone ### System Prompt The agent follows strict output formatting rules defined in [system_prompt.py](system_prompt.py): - Returns only the final answer (no conversational filler) - No markdown formatting or JSON structures - Uses tools instead of guessing - Handles encoded/reversed text - Verifies answers before output ## Configuration ### Change Agent Type Edit the `ACTIVE_AGENT` variable in [config.py:32](config.py#L32): ```python # Valid values: "LangGraph", "ReActLangGraph", "LLamaIndex", "SMOL" ACTIVE_AGENT = "LangGraph" # Currently only LangGraph is implemented ``` The `MyGAIAAgents` wrapper class will automatically instantiate the correct agent based on this configuration. ### Adjust Step Limits Modify the maximum iteration count in [agents.py:169](agents.py#L169): ```python if step_count >= 25: # Change this value # ... ``` ### Customize Tools Add or modify tools in [custom_tools.py](custom_tools.py) using the `@tool` decorator: ```python from langchain_core.tools import tool @tool def my_custom_tool(param: str) -> str: """Tool description for the LLM.""" # Implementation return result ``` ## API Integration The agent integrates with the GAIA benchmark API: - **Questions Endpoint**: `https://agents-course-unit4-scoring.hf.space/questions` - **Submit Endpoint**: `https://agents-course-unit4-scoring.hf.space/submit` Questions may include file references which are automatically fetched from: - Local `files/` directory (if available) - Remote API endpoint (fallback) ## Testing ### Local Ground Truth Verification The app includes local verification against ground truth data in `files/metadata.jsonl`. This allows you to test your agent's performance before submitting to the evaluation server. ### Test Mode Run specific questions in test mode by modifying [app.py:196](app.py#L196): ```python my_questions = [ { "question": my_questions_data[i]["question"], "file_name": my_questions_data[i].get("file_name") } for i in (0, 5, 17) if i < len(my_questions_data) # Customize indices ] ``` ## Performance Considerations - **Timeout**: Agent has 180-second timeout per question - **Step Limit**: Maximum 25 reasoning steps to prevent infinite loops - **Tool Timeouts**: Individual tools have their own timeout settings - **Cost**: Uses Google Gemini API (gemini-2.5-flash model) ## Deployment ### Hugging Face Spaces This project is designed to run on Hugging Face Spaces: 1. Create a new Space on Hugging Face 2. Set SDK to Gradio (version 6.2.0+) 3. Add environment variables: `GOOGLE_API_KEY`, `SPACE_ID`, `SPACE_HOST` 4. Enable OAuth authentication The app will automatically detect the Hugging Face environment and configure URLs accordingly. ### Local Deployment Simply run `python app.py` locally. The app will detect it's not in a Hugging Face Space and adjust behavior accordingly. ## Troubleshooting ### Common Issues **"GOOGLE_API_KEY not found"** - Set the environment variable: `export GOOGLE_API_KEY="your_key"` **Audio parsing fails** - Install ffmpeg: `apt-get install ffmpeg` (Linux) or `brew install ffmpeg` (macOS) **Tool timeouts** - Adjust timeout values in respective tool functions in [custom_tools.py](custom_tools.py) **Agent exceeds step limit** - Increase limit in [agents.py:169](agents.py#L169) or optimize tool usage in system prompt ## Contributing Contributions are welcome! Areas for improvement: - Add more tools (database access, code execution, etc.) - Move the Benchmark from 50% to 100% - Improve error handling and retry logic - Try with smaller LLMs - Make it work with Ollama ## License This project is open-source and available under the MIT License. ## Acknowledgments - Built for the GAIA (General AI Assistants) benchmark - Uses Google's Gemini model via LangChain - LangGraph framework by LangChain - Gradio for web interface ## Contact For questions, issues, or suggestions, please open an issue on GitHub.