Spaces:
Runtime error
Runtime error
| # GAIA Agent Development Plan | |
| This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process. | |
| **I. Understanding the Task & Data:** | |
| 1. **Analyze common_questions.json:** | |
| * **Structure:** Each entry has `task_id`, `Question`, `Level`, `Final answer`, and sometimes `file_name`. | |
| * **Question Types:** Identify patterns: | |
| * Direct information retrieval (e.g., "How many studio albums..."). | |
| * Web search required (e.g., "On June 6, 2023, an article..."). | |
| * File-based questions (audio, images, code - indicated by `file_name`). | |
| * Logic/reasoning puzzles (e.g., the table-based commutativity question, reversed sentence). | |
| * Multistep questions. | |
| * **Answer Format:** Observe the format of `Final answer` for each type. Note the guidance in `docs/submission_instructions.md` regarding formatting (numbers, few words, comma-separated lists). | |
| * **File Dependencies:** List all unique `file_name` extensions to understand what file processing capabilities are needed (e.g., `.mp3`, `.png`, `.py`, `.xlsx`). | |
| 2. **Review Project Context:** | |
| * **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method). | |
| * **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared. | |
| * **Model:** The agent will use an LLM (like the Llama 3 model mentioned in `docs/log.md`). | |
| **II. Agent Architecture Design (Conceptual):** | |
| 1. **Core Agent Loop (`MyAgent.answer` or `MyAgent.__call__`):** | |
| * **Input:** Question string (and `task_id`/`file_name` if passed separately or parsed from a richer input object). | |
| * **Step 1: Question Analysis & Planning:** | |
| * Use the LLM to understand the question. | |
| * Determine the type of question (web search, file processing, direct knowledge, etc.). | |
| * Identify if any tools are needed. | |
| * Formulate a high-level plan (e.g., "Search web for X, then extract Y from the page"). | |
| * **Step 2: Tool Selection & Execution (if needed):** | |
| * Based on the plan, select and invoke appropriate tools. | |
| * Pass necessary parameters to tools (e.g., search query, file path). | |
| * Collect tool outputs. | |
| * **Step 3: Information Synthesis & Answer Generation:** | |
| * Use the LLM to process tool outputs and any retrieved information. | |
| * Generate the final answer string. | |
| * **Step 4: Answer Formatting:** | |
| * Ensure the answer conforms to the expected format (using guidance from common_questions.json examples and `docs/submission_instructions.md`). This might involve specific post-processing rules or prompting the LLM for a specific format. | |
| * **Output:** Return the formatted answer string. | |
| 2. **Key Modules/Components:** | |
| * **LLM Interaction Module:** | |
| * Handles communication with the chosen LLM (e.g., GPT4All Llama 3). | |
| * Manages prompt construction (system prompts, user prompts, few-shot examples if useful). | |
| * Parses LLM responses. | |
| * **Tool Library:** A collection of functions/classes that the agent can call. | |
| * `WebSearchTool`: | |
| * Input: Search query. | |
| * Action: Uses a search engine API (or simulates browsing if necessary, though direct API is better). | |
| * Output: List of search results (titles, snippets, URLs) or page content. | |
| * `FileReaderTool`: | |
| * Input: File path (derived from `file_name` and `task_id` to locate/fetch the file). | |
| * Action: Reads content based on file type. | |
| * Text files (`.py`): Read as string. | |
| * Spreadsheets (`.xlsx`): Parse relevant data (requires a library like `pandas` or `openpyxl`). | |
| * Audio files (`.mp3`): Transcribe to text (requires a speech-to-text model/API). | |
| * Image files (`.png`): Describe image content or extract text (requires a vision model/API or OCR). | |
| * Output: Processed content (text, structured data). | |
| * `CodeInterpreterTool` (for `.py` files like in task `f918266a-b3e0-4914-865d-4faa564f1aef`): | |
| * Input: Python code string. | |
| * Action: Executes the code in a sandboxed environment. | |
| * Output: Captured stdout/stderr or final expression value. | |
| * *(Potentially)* `KnowledgeBaseTool`: If there's a way to pre-process or index relevant documents/FAQs for faster lookups (though most GAIA questions imply dynamic information retrieval). | |
| * **File Management/Access:** | |
| * Mechanism to locate/download files associated with `task_id` and `file_name`. The API endpoint `GET /files/{task_id}` from `docs/API.md` is relevant here. For local testing with common_questions.json, ensure these files are available locally. | |
| * **Prompt Engineering Strategy:** | |
| * Develop a set of system prompts to guide the agent's behavior (e.g., "You are a helpful AI assistant designed to answer questions from the GAIA benchmark..."). | |
| * Develop task-specific prompts or prompt templates for different question types or tool usage. | |
| * Incorporate answer formatting instructions into prompts. | |
| **III. Development & Testing Strategy:** | |
| 1. **Environment Setup:** | |
| * Install necessary Python libraries for LLM interaction, web requests, file processing (e.g., `requests`, `beautifulsoup4` (for web scraping if needed), `pandas`, `Pillow` (for images), speech recognition libraries, etc.). | |
| 2. **Iterative Implementation:** | |
| * **Phase 1: Basic LLM Agent:** Start with an agent that only uses the LLM for direct-answer questions (no tools). | |
| * **Phase 2: Web Search Integration:** Implement the `WebSearchTool` and integrate it for questions requiring web lookups. | |
| * **Phase 3: File Handling:** | |
| * Implement `FileReaderTool` for one file type at a time (e.g., start with `.txt` or `.py`, then `.mp3`, `.png`, `.xlsx`). | |
| * Implement `CodeInterpreterTool`. | |
| * **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses. | |
| 3. **Testing:** | |
| * Use `common_questions.json` as the primary test set. | |
| * Adapt the script from `docs/testing_recipe.md` to run your agent against these questions and compare outputs. | |
| * Focus on one question type or `task_id` at a time for debugging. | |
| * Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging. | |
| **IV. Pre-computation/Pre-analysis (before coding):** | |
| 1. **Map Question Types to Tools:** For each question in common_questions.json, manually note which tool(s) would ideally be used. This helps prioritize tool development. | |
| * Example: | |
| * `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` (Mercedes Sosa albums): WebSearchTool | |
| * `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision) | |
| * `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText | |
| * `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool | |
| 2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool. |