Spaces:

panos-span
/

gaia-benchmark-agent

Runtime error

App Files Files Community

gaia-benchmark-agent / README.md

panos-span

Update README.md

51b1a02 verified 10 months ago

preview code

raw

history blame contribute delete

3.23 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Gaia Benchmark Agent
emoji: 🏆
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
hf_oauth: true

GAIA Benchmark Agent with smolagents

An advanced AI agent implementation for the GAIA benchmark using smolagents and Qwen 32B model.

🚀 Features

Qwen 32B Model: State-of-the-art reasoning capabilities
Comprehensive Tools: Web search, Wikipedia, calculations, file processing
Parallel Processing: Efficient multi-question handling
GAIA Optimized: Specifically tuned for benchmark requirements
Secure Execution: Sandboxed code execution environment

🎯 Performance Target

Goal: 30%+ accuracy on GAIA Level 1 questions
Approach: Multi-tool reasoning with precise answer formatting
Evaluation: Exact string matching compliance

🛠️ Setup Instructions

1. Environment Variables

Set the following in your Space settings:

HF_TOKEN: Your Hugging Face API token (required)
TAVILY_API_KEY: Tavily search API key (optional)
SERPER_API_KEY: Serper search API key (optional)

2. API Keys

Hugging Face Token

Go to HF Settings
Create a new token with Read permissions
Add to Space secrets as HF_TOKEN

Optional Search APIs

Tavily: Get API Key
Serper: Get API Key

📁 File Structure

├── app.py              # Main Gradio application
├── agent.py            # smolagents implementation
├── requirements.txt    # Python dependencies
├── system_prompt.txt   # Agent instructions
├── .env.example       # Environment template
├── .gitignore         # Git ignore rules
└── README.md          # This file

🔧 Usage

Login: Authenticate with Hugging Face
Test: Try single questions to verify agent functionality
Evaluate: Run full GAIA benchmark evaluation
Submit: Automatic submission to leaderboard

🧠 Agent Architecture

Framework: smolagents CodeAgent
Model: Qwen/Qwen2.5-32B-Instruct
Tools: 8+ specialized tools for different task types
Processing: Parallel question handling for efficiency

📊 Tool Capabilities

Web Search: Current information and recent events
Wikipedia: Factual and historical information
Mathematics: Complex calculations and conversions
File Processing: Analysis of various file formats
Unit Conversion: Length, weight, temperature, currency

🔒 Security Features

Sandboxed code execution
Authorized import restrictions
API timeout handling
Error recovery mechanisms

📈 Performance Monitoring

The agent provides detailed logging and performance metrics:

Processing time per question
Tool usage statistics
Error tracking and recovery
Success rate monitoring

🔗 Related Links

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference