hemantvirmani's picture
adding support for langfuse tracking]
a1e2111

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: GAIA Benchmark Agent
emoji: πŸ•΅πŸ»β€β™‚οΈ
colorFrom: indigo
colorTo: indigo
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
hf_oauth: true
hf_oauth_expiration_minutes: 480

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

GAIA Benchmark Agent

A LangGraph-based AI agent designed to solve questions from the GAIA (General AI Assistants) benchmark. This agent uses Google's Gemini model with custom tools for web search, file processing, and multimodal analysis to answer complex questions requiring reasoning and information gathering.

Features

  • LangGraph Architecture: Implements a state-graph agent workflow with tool calling capabilities
  • Multimodal Capabilities:
    • Image analysis (PNG, JPG, JPEG, GIF, WebP, BMP)
    • YouTube video analysis and transcript extraction
    • Audio transcription (MP3)
    • PDF and Excel file processing
  • Web Research Tools:
    • DuckDuckGo web search
    • Wikipedia integration
    • ArXiv academic paper search
    • Web page content extraction
  • Mathematical Operations: Basic arithmetic and modulus operations
  • Gradio Interface: User-friendly web UI for testing and evaluation
  • Automated Evaluation: Fetches questions from API, processes them, and submits answers
  • Observability: Built-in integration with Langfuse for tracking traces and metrics

Project Structure

GAIA_Benchmark_Agent/
β”œβ”€β”€ app.py              # Main application entry point
β”œβ”€β”€ agents.py           # LangGraph agent implementation
β”œβ”€β”€ custom_tools.py     # Tool definitions for web search, files, etc.
β”œβ”€β”€ system_prompt.py    # Agent system prompt and instructions
β”œβ”€β”€ gradioapp.py        # Gradio UI components
β”œβ”€β”€ requirements.txt    # Python dependencies
└── files/
    └── metadata.jsonl  # Ground truth data for local testing

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/GAIA_Benchmark_Agent.git
cd GAIA_Benchmark_Agent
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
export GOOGLE_API_KEY="your_google_api_key"
export HUGGINGFACEHUB_API_TOKEN="your_hf_token"  # Optional.  not yet used

# Langfuse Observability (Optional)
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com" # Optional

Requirements

  • Python 3.8+
  • Google API Key (for Gemini model)
  • ffmpeg (optional, for audio processing)

Key Dependencies

  • langchain-core, langgraph - Agent framework
  • langchain-google-genai - Google Gemini integration
  • gradio - Web UI
  • requests, beautifulsoup4 - Web scraping
  • pypdf, pandas - File processing
  • youtube-transcript-api - YouTube integration
  • ddgs - DuckDuckGo search

Usage

Running the Gradio Interface

Launch the web interface for interactive testing:

python app.py

This will start a Gradio app where you can:

  • Log in with your Hugging Face account
  • Run evaluation on all questions
  • Test individual questions
  • View results and scores

Running Local Tests

Test the agent on specific questions without the web interface:

python app.py --test

Edit the question indices in app.py:196 to customize which questions to test.

Using the Agent Programmatically

from agents import MyGAIAAgents

# Initialize agent (automatically uses ACTIVE_AGENT from config)
agent = MyGAIAAgents()

# Ask a question
answer = agent("What is the capital of France?")
print(answer)

# Ask a question with a file reference
answer = agent(
    "What data is in this spreadsheet?",
    file_name="data.xlsx"
)
print(answer)

How It Works

Agent Architecture

The agent is built using LangGraph with the following workflow:

  1. Initialize: Loads the question and system prompt
  2. Assistant Node: Calls the LLM (Gemini) to decide on tool usage
  3. Tool Node: Executes requested tools (search, file reading, etc.)
  4. Iteration: Loops between assistant and tools until answer is found
  5. Termination: Returns final answer or hits step limit (25 steps max)

Available Tools

Search & Research:

  • websearch - DuckDuckGo web search
  • wiki_search - Wikipedia articles
  • arvix_search - Academic papers
  • get_webpage_content - Extract webpage text
  • get_youtube_transcript - YouTube video transcripts
  • analyze_youtube_video - AI analysis of YouTube videos

File Processing:

  • read_excel_file - Read Excel spreadsheets
  • read_python_script - Read Python source code
  • parse_audio_file - Transcribe MP3 files
  • analyze_image - AI vision analysis of images

Utilities:

  • Math operations: add, subtract, multiply, divide, power, modulus
  • string_reverse - Reverse encoded/gibberish text
  • get_current_time_in_timezone - Get time in any timezone

System Prompt

The agent follows strict output formatting rules defined in system_prompt.py:

  • Returns only the final answer (no conversational filler)
  • No markdown formatting or JSON structures
  • Uses tools instead of guessing
  • Handles encoded/reversed text
  • Verifies answers before output

Configuration

Change Agent Type

Edit the ACTIVE_AGENT variable in config.py:32:

# Valid values: "LangGraph", "ReActLangGraph", "LLamaIndex", "SMOL"
ACTIVE_AGENT = "LangGraph"  # Currently only LangGraph is implemented

The MyGAIAAgents wrapper class will automatically instantiate the correct agent based on this configuration.

Adjust Step Limits

Modify the maximum iteration count in agents.py:169:

if step_count >= 25:  # Change this value
    # ...

Customize Tools

Add or modify tools in custom_tools.py using the @tool decorator:

from langchain_core.tools import tool

@tool
def my_custom_tool(param: str) -> str:
    """Tool description for the LLM."""
    # Implementation
    return result

API Integration

The agent integrates with the GAIA benchmark API:

  • Questions Endpoint: https://agents-course-unit4-scoring.hf.space/questions
  • Submit Endpoint: https://agents-course-unit4-scoring.hf.space/submit

Questions may include file references which are automatically fetched from:

  • Local files/ directory (if available)
  • Remote API endpoint (fallback)

Testing

Local Ground Truth Verification

The app includes local verification against ground truth data in files/metadata.jsonl. This allows you to test your agent's performance before submitting to the evaluation server.

Test Mode

Run specific questions in test mode by modifying app.py:196:

my_questions = [
    {
        "question": my_questions_data[i]["question"],
        "file_name": my_questions_data[i].get("file_name")
    }
    for i in (0, 5, 17) if i < len(my_questions_data)  # Customize indices
]

Performance Considerations

  • Timeout: Agent has 180-second timeout per question
  • Step Limit: Maximum 25 reasoning steps to prevent infinite loops
  • Tool Timeouts: Individual tools have their own timeout settings
  • Cost: Uses Google Gemini API (gemini-2.5-flash model)

Deployment

Hugging Face Spaces

This project is designed to run on Hugging Face Spaces:

  1. Create a new Space on Hugging Face
  2. Set SDK to Gradio (version 6.2.0+)
  3. Add environment variables: GOOGLE_API_KEY, SPACE_ID, SPACE_HOST
  4. Enable OAuth authentication

The app will automatically detect the Hugging Face environment and configure URLs accordingly.

Local Deployment

Simply run python app.py locally. The app will detect it's not in a Hugging Face Space and adjust behavior accordingly.

Troubleshooting

Common Issues

"GOOGLE_API_KEY not found"

  • Set the environment variable: export GOOGLE_API_KEY="your_key"

Audio parsing fails

  • Install ffmpeg: apt-get install ffmpeg (Linux) or brew install ffmpeg (macOS)

Tool timeouts

Agent exceeds step limit

  • Increase limit in agents.py:169 or optimize tool usage in system prompt

Contributing

Contributions are welcome! Areas for improvement:

  • Add more tools (database access, code execution, etc.)
  • Move the Benchmark from 50% to 100%
  • Improve error handling and retry logic
  • Try with smaller LLMs
  • Make it work with Ollama

License

This project is open-source and available under the MIT License.

Acknowledgments

  • Built for the GAIA (General AI Assistants) benchmark
  • Uses Google's Gemini model via LangChain
  • LangGraph framework by LangChain
  • Gradio for web interface

Contact

For questions, issues, or suggestions, please open an issue on GitHub.