Spaces:
Runtime error
Runtime error
Yago Bolivar commited on
Commit ·
736bdeb
1
Parent(s): a3c3cd5
fix: clarify GAIA agent development plan and remove unnecessary lines from testing recipe
Browse files- docs/devplan.md +5 -6
- docs/testing_recipe.md +1 -6
- notes.md +3 -1
docs/devplan.md
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
# GAIA Agent Development Plan
|
| 2 |
-
|
|
|
|
| 3 |
**I. Understanding the Task & Data:**
|
| 4 |
|
| 5 |
1. **Analyze common_questions.json:**
|
|
@@ -16,7 +17,7 @@
|
|
| 16 |
2. **Review Project Context:**
|
| 17 |
* **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
|
| 18 |
* **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
|
| 19 |
-
* **Model:** The agent will
|
| 20 |
|
| 21 |
**II. Agent Architecture Design (Conceptual):**
|
| 22 |
|
|
@@ -82,7 +83,7 @@
|
|
| 82 |
* **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
|
| 83 |
3. **Testing:**
|
| 84 |
* Use `common_questions.json` as the primary test set.
|
| 85 |
-
* Adapt the script from `docs/testing_recipe.md`
|
| 86 |
* Focus on one question type or `task_id` at a time for debugging.
|
| 87 |
* Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
|
| 88 |
|
|
@@ -94,6 +95,4 @@
|
|
| 94 |
* `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
|
| 95 |
* `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
|
| 96 |
* `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
|
| 97 |
-
2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
|
| 98 |
-
|
| 99 |
-
This structured approach should provide a solid foundation for developing the agent. The key will be modularity, robust tool implementation, and effective prompt engineering to guide the LLM.
|
|
|
|
| 1 |
# GAIA Agent Development Plan
|
| 2 |
+
This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
|
| 3 |
+
|
| 4 |
**I. Understanding the Task & Data:**
|
| 5 |
|
| 6 |
1. **Analyze common_questions.json:**
|
|
|
|
| 17 |
2. **Review Project Context:**
|
| 18 |
* **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
|
| 19 |
* **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
|
| 20 |
+
* **Model:** The agent will use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
|
| 21 |
|
| 22 |
**II. Agent Architecture Design (Conceptual):**
|
| 23 |
|
|
|
|
| 83 |
* **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
|
| 84 |
3. **Testing:**
|
| 85 |
* Use `common_questions.json` as the primary test set.
|
| 86 |
+
* Adapt the script from `docs/testing_recipe.md` to run your agent against these questions and compare outputs.
|
| 87 |
* Focus on one question type or `task_id` at a time for debugging.
|
| 88 |
* Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
|
| 89 |
|
|
|
|
| 95 |
* `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
|
| 96 |
* `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
|
| 97 |
* `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
|
| 98 |
+
2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
|
|
|
|
|
|
docs/testing_recipe.md
CHANGED
|
@@ -1,6 +1,3 @@
|
|
| 1 |
-
Pensó durante 4 segundos
|
| 2 |
-
|
| 3 |
-
|
| 4 |
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
|
| 5 |
|
| 6 |
---
|
|
@@ -113,6 +110,4 @@ python3 evaluate_agent.py question_set/common_questions.json
|
|
| 113 |
### 5 Interpreting results
|
| 114 |
|
| 115 |
* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
|
| 116 |
-
* **Latency** helps you spot outliers in run time (e.g. long tool chains).
|
| 117 |
-
|
| 118 |
-
That’s all you need to benchmark quickly. Happy testing!
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
|
| 2 |
|
| 3 |
---
|
|
|
|
| 110 |
### 5 Interpreting results
|
| 111 |
|
| 112 |
* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
|
| 113 |
+
* **Latency** helps you spot outliers in run time (e.g. long tool chains).
|
|
|
|
|
|
notes.md
CHANGED
|
@@ -1,4 +1,6 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
| 2 |
|
| 3 |
## utilities/ Python scripts:
|
| 4 |
- random_questions.py: fetches random questions from the GAIA API
|
|
|
|
| 1 |
+
# NOTES
|
| 2 |
+
## general notes
|
| 3 |
+
- There are 5 questions that require the interpretation of a file
|
| 4 |
|
| 5 |
## utilities/ Python scripts:
|
| 6 |
- random_questions.py: fetches random questions from the GAIA API
|