panos-span's picture
Update README.md
51b1a02 verified

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Gaia Benchmark Agent
emoji: πŸ†
colorFrom: red
colorTo: gray
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
hf_oauth: true

GAIA Benchmark Agent with smolagents

An advanced AI agent implementation for the GAIA benchmark using smolagents and Qwen 32B model.

πŸš€ Features

  • Qwen 32B Model: State-of-the-art reasoning capabilities
  • Comprehensive Tools: Web search, Wikipedia, calculations, file processing
  • Parallel Processing: Efficient multi-question handling
  • GAIA Optimized: Specifically tuned for benchmark requirements
  • Secure Execution: Sandboxed code execution environment

🎯 Performance Target

  • Goal: 30%+ accuracy on GAIA Level 1 questions
  • Approach: Multi-tool reasoning with precise answer formatting
  • Evaluation: Exact string matching compliance

πŸ› οΈ Setup Instructions

1. Environment Variables

Set the following in your Space settings:

  • HF_TOKEN: Your Hugging Face API token (required)
  • TAVILY_API_KEY: Tavily search API key (optional)
  • SERPER_API_KEY: Serper search API key (optional)

2. API Keys

Hugging Face Token

  1. Go to HF Settings
  2. Create a new token with Read permissions
  3. Add to Space secrets as HF_TOKEN

Optional Search APIs

πŸ“ File Structure

β”œβ”€β”€ app.py              # Main Gradio application
β”œβ”€β”€ agent.py            # smolagents implementation
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ system_prompt.txt   # Agent instructions
β”œβ”€β”€ .env.example       # Environment template
β”œβ”€β”€ .gitignore         # Git ignore rules
└── README.md          # This file

πŸ”§ Usage

  1. Login: Authenticate with Hugging Face
  2. Test: Try single questions to verify agent functionality
  3. Evaluate: Run full GAIA benchmark evaluation
  4. Submit: Automatic submission to leaderboard

🧠 Agent Architecture

  • Framework: smolagents CodeAgent
  • Model: Qwen/Qwen2.5-32B-Instruct
  • Tools: 8+ specialized tools for different task types
  • Processing: Parallel question handling for efficiency

πŸ“Š Tool Capabilities

  • Web Search: Current information and recent events
  • Wikipedia: Factual and historical information
  • Mathematics: Complex calculations and conversions
  • File Processing: Analysis of various file formats
  • Unit Conversion: Length, weight, temperature, currency

πŸ”’ Security Features

  • Sandboxed code execution
  • Authorized import restrictions
  • API timeout handling
  • Error recovery mechanisms

πŸ“ˆ Performance Monitoring

The agent provides detailed logging and performance metrics:

  • Processing time per question
  • Tool usage statistics
  • Error tracking and recovery
  • Success rate monitoring

πŸ”— Related Links

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference