A newer version of the Gradio SDK is available: 6.11.0
title: GAIA Benchmark Agent
emoji: π΅π»ββοΈ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
GAIA Benchmark Agent
A LangGraph-based AI agent designed to solve questions from the GAIA (General AI Assistants) benchmark. This agent uses Google's Gemini model with custom tools for web search, file processing, and multimodal analysis to answer complex questions requiring reasoning and information gathering.
Features
- LangGraph Architecture: Implements a state-graph agent workflow with tool calling capabilities
- Multimodal Capabilities:
- Image analysis (PNG, JPG, JPEG, GIF, WebP, BMP)
- YouTube video analysis and transcript extraction
- Audio transcription (MP3)
- PDF and Excel file processing
- Web Research Tools:
- DuckDuckGo web search
- Wikipedia integration
- ArXiv academic paper search
- Web page content extraction
- Mathematical Operations: Basic arithmetic and modulus operations
- Gradio Interface: User-friendly web UI for testing and evaluation
- Automated Evaluation: Fetches questions from API, processes them, and submits answers
- Observability: Built-in integration with Langfuse for tracking traces and metrics
Project Structure
GAIA_Benchmark_Agent/
βββ app.py # Main application entry point
βββ agents.py # LangGraph agent implementation
βββ custom_tools.py # Tool definitions for web search, files, etc.
βββ system_prompt.py # Agent system prompt and instructions
βββ gradioapp.py # Gradio UI components
βββ requirements.txt # Python dependencies
βββ files/
βββ metadata.jsonl # Ground truth data for local testing
Installation
- Clone the repository:
git clone https://github.com/yourusername/GAIA_Benchmark_Agent.git
cd GAIA_Benchmark_Agent
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
export GOOGLE_API_KEY="your_google_api_key"
export HUGGINGFACEHUB_API_TOKEN="your_hf_token" # Optional. not yet used
# Langfuse Observability (Optional)
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com" # Optional
Requirements
- Python 3.8+
- Google API Key (for Gemini model)
- ffmpeg (optional, for audio processing)
Key Dependencies
langchain-core,langgraph- Agent frameworklangchain-google-genai- Google Gemini integrationgradio- Web UIrequests,beautifulsoup4- Web scrapingpypdf,pandas- File processingyoutube-transcript-api- YouTube integrationddgs- DuckDuckGo search
Usage
Running the Gradio Interface
Launch the web interface for interactive testing:
python app.py
This will start a Gradio app where you can:
- Log in with your Hugging Face account
- Run evaluation on all questions
- Test individual questions
- View results and scores
Running Local Tests
Test the agent on specific questions without the web interface:
python app.py --test
Edit the question indices in app.py:196 to customize which questions to test.
Using the Agent Programmatically
from agents import MyGAIAAgents
# Initialize agent (automatically uses ACTIVE_AGENT from config)
agent = MyGAIAAgents()
# Ask a question
answer = agent("What is the capital of France?")
print(answer)
# Ask a question with a file reference
answer = agent(
"What data is in this spreadsheet?",
file_name="data.xlsx"
)
print(answer)
How It Works
Agent Architecture
The agent is built using LangGraph with the following workflow:
- Initialize: Loads the question and system prompt
- Assistant Node: Calls the LLM (Gemini) to decide on tool usage
- Tool Node: Executes requested tools (search, file reading, etc.)
- Iteration: Loops between assistant and tools until answer is found
- Termination: Returns final answer or hits step limit (25 steps max)
Available Tools
Search & Research:
websearch- DuckDuckGo web searchwiki_search- Wikipedia articlesarvix_search- Academic papersget_webpage_content- Extract webpage textget_youtube_transcript- YouTube video transcriptsanalyze_youtube_video- AI analysis of YouTube videos
File Processing:
read_excel_file- Read Excel spreadsheetsread_python_script- Read Python source codeparse_audio_file- Transcribe MP3 filesanalyze_image- AI vision analysis of images
Utilities:
- Math operations:
add,subtract,multiply,divide,power,modulus string_reverse- Reverse encoded/gibberish textget_current_time_in_timezone- Get time in any timezone
System Prompt
The agent follows strict output formatting rules defined in system_prompt.py:
- Returns only the final answer (no conversational filler)
- No markdown formatting or JSON structures
- Uses tools instead of guessing
- Handles encoded/reversed text
- Verifies answers before output
Configuration
Change Agent Type
Edit the ACTIVE_AGENT variable in config.py:32:
# Valid values: "LangGraph", "ReActLangGraph", "LLamaIndex", "SMOL"
ACTIVE_AGENT = "LangGraph" # Currently only LangGraph is implemented
The MyGAIAAgents wrapper class will automatically instantiate the correct agent based on this configuration.
Adjust Step Limits
Modify the maximum iteration count in agents.py:169:
if step_count >= 25: # Change this value
# ...
Customize Tools
Add or modify tools in custom_tools.py using the @tool decorator:
from langchain_core.tools import tool
@tool
def my_custom_tool(param: str) -> str:
"""Tool description for the LLM."""
# Implementation
return result
API Integration
The agent integrates with the GAIA benchmark API:
- Questions Endpoint:
https://agents-course-unit4-scoring.hf.space/questions - Submit Endpoint:
https://agents-course-unit4-scoring.hf.space/submit
Questions may include file references which are automatically fetched from:
- Local
files/directory (if available) - Remote API endpoint (fallback)
Testing
Local Ground Truth Verification
The app includes local verification against ground truth data in files/metadata.jsonl. This allows you to test your agent's performance before submitting to the evaluation server.
Test Mode
Run specific questions in test mode by modifying app.py:196:
my_questions = [
{
"question": my_questions_data[i]["question"],
"file_name": my_questions_data[i].get("file_name")
}
for i in (0, 5, 17) if i < len(my_questions_data) # Customize indices
]
Performance Considerations
- Timeout: Agent has 180-second timeout per question
- Step Limit: Maximum 25 reasoning steps to prevent infinite loops
- Tool Timeouts: Individual tools have their own timeout settings
- Cost: Uses Google Gemini API (gemini-2.5-flash model)
Deployment
Hugging Face Spaces
This project is designed to run on Hugging Face Spaces:
- Create a new Space on Hugging Face
- Set SDK to Gradio (version 6.2.0+)
- Add environment variables:
GOOGLE_API_KEY,SPACE_ID,SPACE_HOST - Enable OAuth authentication
The app will automatically detect the Hugging Face environment and configure URLs accordingly.
Local Deployment
Simply run python app.py locally. The app will detect it's not in a Hugging Face Space and adjust behavior accordingly.
Troubleshooting
Common Issues
"GOOGLE_API_KEY not found"
- Set the environment variable:
export GOOGLE_API_KEY="your_key"
Audio parsing fails
- Install ffmpeg:
apt-get install ffmpeg(Linux) orbrew install ffmpeg(macOS)
Tool timeouts
- Adjust timeout values in respective tool functions in custom_tools.py
Agent exceeds step limit
- Increase limit in agents.py:169 or optimize tool usage in system prompt
Contributing
Contributions are welcome! Areas for improvement:
- Add more tools (database access, code execution, etc.)
- Move the Benchmark from 50% to 100%
- Improve error handling and retry logic
- Try with smaller LLMs
- Make it work with Ollama
License
This project is open-source and available under the MIT License.
Acknowledgments
- Built for the GAIA (General AI Assistants) benchmark
- Uses Google's Gemini model via LangChain
- LangGraph framework by LangChain
- Gradio for web interface
Contact
For questions, issues, or suggestions, please open an issue on GitHub.