Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: Gaia Benchmark Agent
emoji: π
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
hf_oauth: true
GAIA Benchmark Agent with smolagents
An advanced AI agent implementation for the GAIA benchmark using smolagents and Qwen 32B model.
π Features
- Qwen 32B Model: State-of-the-art reasoning capabilities
- Comprehensive Tools: Web search, Wikipedia, calculations, file processing
- Parallel Processing: Efficient multi-question handling
- GAIA Optimized: Specifically tuned for benchmark requirements
- Secure Execution: Sandboxed code execution environment
π― Performance Target
- Goal: 30%+ accuracy on GAIA Level 1 questions
- Approach: Multi-tool reasoning with precise answer formatting
- Evaluation: Exact string matching compliance
π οΈ Setup Instructions
1. Environment Variables
Set the following in your Space settings:
HF_TOKEN: Your Hugging Face API token (required)TAVILY_API_KEY: Tavily search API key (optional)SERPER_API_KEY: Serper search API key (optional)
2. API Keys
Hugging Face Token
- Go to HF Settings
- Create a new token with Read permissions
- Add to Space secrets as
HF_TOKEN
Optional Search APIs
- Tavily: Get API Key
- Serper: Get API Key
π File Structure
βββ app.py # Main Gradio application
βββ agent.py # smolagents implementation
βββ requirements.txt # Python dependencies
βββ system_prompt.txt # Agent instructions
βββ .env.example # Environment template
βββ .gitignore # Git ignore rules
βββ README.md # This file
π§ Usage
- Login: Authenticate with Hugging Face
- Test: Try single questions to verify agent functionality
- Evaluate: Run full GAIA benchmark evaluation
- Submit: Automatic submission to leaderboard
π§ Agent Architecture
- Framework: smolagents CodeAgent
- Model: Qwen/Qwen2.5-32B-Instruct
- Tools: 8+ specialized tools for different task types
- Processing: Parallel question handling for efficiency
π Tool Capabilities
- Web Search: Current information and recent events
- Wikipedia: Factual and historical information
- Mathematics: Complex calculations and conversions
- File Processing: Analysis of various file formats
- Unit Conversion: Length, weight, temperature, currency
π Security Features
- Sandboxed code execution
- Authorized import restrictions
- API timeout handling
- Error recovery mechanisms
π Performance Monitoring
The agent provides detailed logging and performance metrics:
- Processing time per question
- Tool usage statistics
- Error tracking and recovery
- Success rate monitoring
π Related Links
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference