Update
Browse files
CHANGELOG.md
CHANGED
|
@@ -25,9 +25,11 @@
|
|
| 25 |
- **app.py**
|
| 26 |
- Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
|
| 27 |
- UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
|
| 28 |
-
- Added `export_results_to_markdown(results_log, submission_status)` - Export evaluation results
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
- Updated run_button click handler - Now outputs 3 values (status, table, export_path)
|
| 32 |
|
| 33 |
- **src/tools/__init__.py** (Fixed earlier in session)
|
|
|
|
| 25 |
- **app.py**
|
| 26 |
- Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
|
| 27 |
- UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
|
| 28 |
+
- Added `export_results_to_markdown(results_log, submission_status)` - Export evaluation results with environment detection
|
| 29 |
+
- Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.md
|
| 30 |
+
- HF Spaces: Saves to ./exports/gaia_results_TIMESTAMP.md (fixes cloud deployment issue)
|
| 31 |
+
- Updated `run_and_submit_all()` - ALL return paths now export results
|
| 32 |
+
- Added gr.File download button - Users can directly download results (better UX than textbox)
|
| 33 |
- Updated run_button click handler - Now outputs 3 values (status, table, export_path)
|
| 34 |
|
| 35 |
- **src/tools/__init__.py** (Fixed earlier in session)
|
app.py
CHANGED
|
@@ -35,13 +35,26 @@ def check_api_keys():
|
|
| 35 |
|
| 36 |
|
| 37 |
def export_results_to_markdown(results_log: list, submission_status: str) -> str:
|
| 38 |
-
"""Export evaluation results to markdown file
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
from datetime import datetime
|
| 40 |
|
| 41 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 42 |
-
downloads_dir = os.path.expanduser("~/Downloads")
|
| 43 |
filename = f"gaia_results_{timestamp}.md"
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
with open(filepath, 'w') as f:
|
| 47 |
# Header
|
|
@@ -414,10 +427,9 @@ with gr.Blocks() as demo:
|
|
| 414 |
# Removed max_rows=10 from DataFrame constructor
|
| 415 |
results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
|
| 416 |
|
| 417 |
-
export_output = gr.
|
| 418 |
-
label="
|
| 419 |
-
|
| 420 |
-
interactive=False
|
| 421 |
)
|
| 422 |
|
| 423 |
run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table, export_output])
|
|
|
|
| 35 |
|
| 36 |
|
| 37 |
def export_results_to_markdown(results_log: list, submission_status: str) -> str:
|
| 38 |
+
"""Export evaluation results to markdown file.
|
| 39 |
+
|
| 40 |
+
- Local: Saves to ~/Downloads
|
| 41 |
+
- HF Spaces: Saves to ./exports/ (for Gradio file download)
|
| 42 |
+
"""
|
| 43 |
from datetime import datetime
|
| 44 |
|
| 45 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
|
|
|
| 46 |
filename = f"gaia_results_{timestamp}.md"
|
| 47 |
+
|
| 48 |
+
# Detect environment: HF Spaces or local
|
| 49 |
+
if os.getenv("SPACE_ID"):
|
| 50 |
+
# HF Spaces: save to local exports directory for Gradio to serve
|
| 51 |
+
export_dir = os.path.join(os.getcwd(), "exports")
|
| 52 |
+
os.makedirs(export_dir, exist_ok=True)
|
| 53 |
+
filepath = os.path.join(export_dir, filename)
|
| 54 |
+
else:
|
| 55 |
+
# Local: save to Downloads folder
|
| 56 |
+
downloads_dir = os.path.expanduser("~/Downloads")
|
| 57 |
+
filepath = os.path.join(downloads_dir, filename)
|
| 58 |
|
| 59 |
with open(filepath, 'w') as f:
|
| 60 |
# Header
|
|
|
|
| 427 |
# Removed max_rows=10 from DataFrame constructor
|
| 428 |
results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
|
| 429 |
|
| 430 |
+
export_output = gr.File(
|
| 431 |
+
label="Download Results",
|
| 432 |
+
type="filepath"
|
|
|
|
| 433 |
)
|
| 434 |
|
| 435 |
run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table, export_output])
|
dev/{dev_260103_16_huggingface_integration.md → dev_260103_16_huggingface_llm_integration.md}
RENAMED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# [dev_260103_16] HuggingFace
|
| 2 |
|
| 3 |
**Date:** 2026-01-03
|
| 4 |
**Type:** Development
|
|
@@ -27,7 +27,7 @@
|
|
| 27 |
|
| 28 |
## Key Decisions
|
| 29 |
|
| 30 |
-
### **Decision 1: HuggingFace
|
| 31 |
|
| 32 |
**Why chosen:**
|
| 33 |
|
|
@@ -50,7 +50,7 @@
|
|
| 50 |
|
| 51 |
- ✅ Excellent function calling capabilities (OpenAI-compatible tools format)
|
| 52 |
- ✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
|
| 53 |
-
- ✅ Free on HuggingFace
|
| 54 |
- ✅ 72B parameters - sufficient intelligence for GAIA tasks
|
| 55 |
|
| 56 |
**Considered alternatives:**
|
|
@@ -102,7 +102,7 @@
|
|
| 102 |
|
| 103 |
## Outcome
|
| 104 |
|
| 105 |
-
Successfully integrated HuggingFace
|
| 106 |
|
| 107 |
**Deliverables:**
|
| 108 |
|
|
|
|
| 1 |
+
# [dev_260103_16] HuggingFace LLM API Integration
|
| 2 |
|
| 3 |
**Date:** 2026-01-03
|
| 4 |
**Type:** Development
|
|
|
|
| 27 |
|
| 28 |
## Key Decisions
|
| 29 |
|
| 30 |
+
### **Decision 1: HuggingFace LLM API over Ollama (local LLMs)**
|
| 31 |
|
| 32 |
**Why chosen:**
|
| 33 |
|
|
|
|
| 50 |
|
| 51 |
- ✅ Excellent function calling capabilities (OpenAI-compatible tools format)
|
| 52 |
- ✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
|
| 53 |
+
- ✅ Free on HuggingFace LLM API
|
| 54 |
- ✅ 72B parameters - sufficient intelligence for GAIA tasks
|
| 55 |
|
| 56 |
**Considered alternatives:**
|
|
|
|
| 102 |
|
| 103 |
## Outcome
|
| 104 |
|
| 105 |
+
Successfully integrated HuggingFace LLM API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.
|
| 106 |
|
| 107 |
**Deliverables:**
|
| 108 |
|
exports/gaia_results_20260104_005516.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAIA Agent Evaluation Results
|
| 2 |
+
|
| 3 |
+
**Generated:** 2026-01-04 00:55:16
|
| 4 |
+
|
| 5 |
+
## Submission Status
|
| 6 |
+
|
| 7 |
+
Submission Successful!
|
| 8 |
+
User: mangoobee
|
| 9 |
+
Overall Score: 0.0% (0/20 correct)
|
| 10 |
+
Message: Score calculated successfully: 0/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.
|
| 11 |
+
|
| 12 |
+
## Questions and Answers
|
| 13 |
+
|
| 14 |
+
| Task ID | Question | Submitted Answer |
|
| 15 |
+
|---------|----------|------------------|
|
| 16 |
+
| 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can ... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 17 |
+
| a1e91b78-d3d8-4675-bb8d-62741b4b68a6 | In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird spec... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 18 |
+
| 2d83110e-a098-4ebb-9987-066c06fa42d0 | .rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 19 |
+
| cca530fc-4052-43b2-b130-b30968d8aa44 | Review the chess position provided in the image. It is black's turn. Provide the correct next mov... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 20 |
+
| 4fc2f1ae-8625-45b5-ab34-ad4433bc21f8 | Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 21 |
+
| 6f37996b-2ac7-44b0-8e68-6d28256631b4 | Given this table defining * on the set S = {a, b, c, d, e} \|*\|a\|b\|c\|d\|e\| \|---\|---\|---\... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 22 |
+
| 9d191bce-651d-4746-be2d-7ef8ecadb9c2 | Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec. What does Teal'c say in respon... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 23 |
+
| cabe07ed-9eca-40ea-8ead-410ef5e83f91 | What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry mate... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 24 |
+
| 3cef3a44-215e-4aed-8e3b-b1e3f08063b7 | I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler w... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 25 |
+
| 99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3 | Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 26 |
+
| 305ac316-eef6-4446-960a-92d80d542f82 | Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 27 |
+
| f918266a-b3e0-4914-865d-4faa564f1aef | What is the final numeric output from the attached Python code? | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 28 |
+
| 3f57289b-8c60-48be-bd80-01f8099ca449 | How many at bats did the Yankee with the most walks in the 1977 regular season have that same sea... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 29 |
+
| 1f975693-876d-457b-a649-393859e79bf3 | Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 30 |
+
| 840bfca7-4f7b-481a-8794-c560c340185d | On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This art... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 31 |
+
| bda648d7-d618-4883-88f4-3466eabd860e | Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 32 |
+
| cf106601-ab4f-4af9-b045-5295fe67b37d | What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 33 |
+
| a0c07678-e491-4bbc-8f0b-07405144218f | Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 34 |
+
| 7bd855d8-463d-4ed5-93ca-5fe35145f733 | The attached Excel file contains the sales of menu items for a local fast-food chain. What were t... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 35 |
+
| 5a0c1adf-205e-4841-a666-7c3ef95def9d | What is the first name of the only Malko Competition recipient from the 20th Century (after 1977)... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
exports/gaia_results_20260104_005610.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAIA Agent Evaluation Results
|
| 2 |
+
|
| 3 |
+
**Generated:** 2026-01-04 00:56:10
|
| 4 |
+
|
| 5 |
+
## Submission Status
|
| 6 |
+
|
| 7 |
+
Submission Successful!
|
| 8 |
+
User: mangoobee
|
| 9 |
+
Overall Score: 0.0% (0/20 correct)
|
| 10 |
+
Message: Score calculated successfully: 0/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.
|
| 11 |
+
|
| 12 |
+
## Questions and Answers
|
| 13 |
+
|
| 14 |
+
| Task ID | Question | Submitted Answer |
|
| 15 |
+
|---------|----------|------------------|
|
| 16 |
+
| 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can ... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 17 |
+
| a1e91b78-d3d8-4675-bb8d-62741b4b68a6 | In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird spec... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 18 |
+
| 2d83110e-a098-4ebb-9987-066c06fa42d0 | .rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 19 |
+
| cca530fc-4052-43b2-b130-b30968d8aa44 | Review the chess position provided in the image. It is black's turn. Provide the correct next mov... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 20 |
+
| 4fc2f1ae-8625-45b5-ab34-ad4433bc21f8 | Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 21 |
+
| 6f37996b-2ac7-44b0-8e68-6d28256631b4 | Given this table defining * on the set S = {a, b, c, d, e} \|*\|a\|b\|c\|d\|e\| \|---\|---\|---\... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 22 |
+
| 9d191bce-651d-4746-be2d-7ef8ecadb9c2 | Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec. What does Teal'c say in respon... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 23 |
+
| cabe07ed-9eca-40ea-8ead-410ef5e83f91 | What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry mate... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 24 |
+
| 3cef3a44-215e-4aed-8e3b-b1e3f08063b7 | I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler w... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 25 |
+
| 99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3 | Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 26 |
+
| 305ac316-eef6-4446-960a-92d80d542f82 | Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 27 |
+
| f918266a-b3e0-4914-865d-4faa564f1aef | What is the final numeric output from the attached Python code? | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 28 |
+
| 3f57289b-8c60-48be-bd80-01f8099ca449 | How many at bats did the Yankee with the most walks in the 1977 regular season have that same sea... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 29 |
+
| 1f975693-876d-457b-a649-393859e79bf3 | Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 30 |
+
| 840bfca7-4f7b-481a-8794-c560c340185d | On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This art... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 31 |
+
| bda648d7-d618-4883-88f4-3466eabd860e | Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 32 |
+
| cf106601-ab4f-4af9-b045-5295fe67b37d | What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 33 |
+
| a0c07678-e491-4bbc-8f0b-07405144218f | Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 34 |
+
| 7bd855d8-463d-4ed5-93ca-5fe35145f733 | The attached Excel file contains the sales of menu items for a local fast-food chain. What were t... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|
| 35 |
+
| 5a0c1adf-205e-4841-a666-7c3ef95def9d | What is the first name of the only Malko Competition recipient from the 20th Century (after 1977)... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
|