# GAIA Multi-Agent Evaluation System A multi-agent system built with **LangGraph** and **LangChain** to tackle the [GAIA benchmark](https://huggingface.co/spaces/gaia-benchmark/leaderboard) — a set of real-world questions that test AI assistants on reasoning, tool use, and multimodal understanding. ## How It Works A **supervisor agent** analyzes each incoming question and delegates it to one of four specialized sub-agents: | Agent | Responsibility | Tools | |---|---|---| | **Web Research** | Factual lookups, current events, YouTube video analysis | Tavily Search, Wikipedia, Gemini 2.5 Pro Video | | **Code Execution** | Python programming, algorithms, data processing | Python REPL | | **File Processing** | Excel, CSV, PDF, audio, image analysis | GAIA File Downloader, Pandas, Whisper, GPT-5-mini Vision | | **Math/Reasoning** | Arithmetic, algebra, calculus, statistics | Calculator, Python REPL | See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams and data flow. ## Project Structure ``` ├── app.py # Gradio UI + submission logic ├── agent.py # GAIAAgent class (supervisor wrapper) ├── prompts.py # Shared GAIA answer format prompt ├── agents/ │ ├── supervisor.py # LangGraph supervisor graph │ ├── web_research.py # Web search + video agent │ ├── code_agent.py # Code execution agent │ ├── file_agent.py # File processing agent │ └── math_agent.py # Math/reasoning agent ├── tools/ │ ├── search_tools.py # Tavily + Wikipedia │ ├── video_tools.py # Gemini YouTube video analysis │ ├── code_tools.py # Python REPL │ ├── file_tools.py # File download, Excel, audio, image, PDF │ └── math_tools.py # Calculator + Python REPL ├── requirements.txt └── test_agent.py # Local testing script ``` ## Setup ### Environment Variables Set these in a local `.env` file: | Variable | Purpose | |---|---| | `OPENAI_API_KEY` | GPT-5-mini for reasoning, vision, and Whisper transcription | | `TAVILY_API_KEY` | Web search via Tavily | | `GOOGLE_API_KEY` | Gemini 2.5 Pro for YouTube video analysis | | `HF_TOKEN` | HuggingFace token for downloading GAIA dataset files | ### Local Development ```bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python test_agent.py # test on a random GAIA question python app.py # launch Gradio UI ``` ## Scoring The GAIA benchmark uses **exact match** scoring. The agent uses the official GAIA answer format prompt — reasoning through each question before producing a concise `FINAL ANSWER` (a number, a few words, or a comma-separated list) with no articles, abbreviations, or units unless specified.