mangubee commited on
Commit
2fc4228
·
1 Parent(s): cba6286
CHANGELOG.md CHANGED
@@ -25,9 +25,11 @@
25
  - **app.py**
26
  - Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
27
  - UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
28
- - Added `export_results_to_markdown(results_log, submission_status)` - Export evaluation results to markdown file
29
- - Updated `run_and_submit_all()` - ALL return paths now export results to ~/Downloads/gaia_results_TIMESTAMP.md
30
- - Added export_output UI component - Displays exported file path to user
 
 
31
  - Updated run_button click handler - Now outputs 3 values (status, table, export_path)
32
 
33
  - **src/tools/__init__.py** (Fixed earlier in session)
 
25
  - **app.py**
26
  - Updated `check_api_keys()` - Added HF_TOKEN status display in Test & Debug tab
27
  - UI now shows: "HF_TOKEN (HuggingFace): ✓ SET" or "✗ MISSING"
28
+ - Added `export_results_to_markdown(results_log, submission_status)` - Export evaluation results with environment detection
29
+ - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.md
30
+ - HF Spaces: Saves to ./exports/gaia_results_TIMESTAMP.md (fixes cloud deployment issue)
31
+ - Updated `run_and_submit_all()` - ALL return paths now export results
32
+ - Added gr.File download button - Users can directly download results (better UX than textbox)
33
  - Updated run_button click handler - Now outputs 3 values (status, table, export_path)
34
 
35
  - **src/tools/__init__.py** (Fixed earlier in session)
app.py CHANGED
@@ -35,13 +35,26 @@ def check_api_keys():
35
 
36
 
37
  def export_results_to_markdown(results_log: list, submission_status: str) -> str:
38
- """Export evaluation results to markdown file in Downloads folder."""
 
 
 
 
39
  from datetime import datetime
40
 
41
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
42
- downloads_dir = os.path.expanduser("~/Downloads")
43
  filename = f"gaia_results_{timestamp}.md"
44
- filepath = os.path.join(downloads_dir, filename)
 
 
 
 
 
 
 
 
 
 
45
 
46
  with open(filepath, 'w') as f:
47
  # Header
@@ -414,10 +427,9 @@ with gr.Blocks() as demo:
414
  # Removed max_rows=10 from DataFrame constructor
415
  results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
416
 
417
- export_output = gr.Textbox(
418
- label="Exported Results",
419
- placeholder="Results will be exported to markdown file in ~/Downloads",
420
- interactive=False
421
  )
422
 
423
  run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table, export_output])
 
35
 
36
 
37
  def export_results_to_markdown(results_log: list, submission_status: str) -> str:
38
+ """Export evaluation results to markdown file.
39
+
40
+ - Local: Saves to ~/Downloads
41
+ - HF Spaces: Saves to ./exports/ (for Gradio file download)
42
+ """
43
  from datetime import datetime
44
 
45
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
 
46
  filename = f"gaia_results_{timestamp}.md"
47
+
48
+ # Detect environment: HF Spaces or local
49
+ if os.getenv("SPACE_ID"):
50
+ # HF Spaces: save to local exports directory for Gradio to serve
51
+ export_dir = os.path.join(os.getcwd(), "exports")
52
+ os.makedirs(export_dir, exist_ok=True)
53
+ filepath = os.path.join(export_dir, filename)
54
+ else:
55
+ # Local: save to Downloads folder
56
+ downloads_dir = os.path.expanduser("~/Downloads")
57
+ filepath = os.path.join(downloads_dir, filename)
58
 
59
  with open(filepath, 'w') as f:
60
  # Header
 
427
  # Removed max_rows=10 from DataFrame constructor
428
  results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
429
 
430
+ export_output = gr.File(
431
+ label="Download Results",
432
+ type="filepath"
 
433
  )
434
 
435
  run_button.click(fn=run_and_submit_all, outputs=[status_output, results_table, export_output])
dev/{dev_260103_16_huggingface_integration.md → dev_260103_16_huggingface_llm_integration.md} RENAMED
@@ -1,4 +1,4 @@
1
- # [dev_260103_16] HuggingFace Inference API Integration
2
 
3
  **Date:** 2026-01-03
4
  **Type:** Development
@@ -27,7 +27,7 @@
27
 
28
  ## Key Decisions
29
 
30
- ### **Decision 1: HuggingFace Inference API over Ollama (local LLMs)**
31
 
32
  **Why chosen:**
33
 
@@ -50,7 +50,7 @@
50
 
51
  - ✅ Excellent function calling capabilities (OpenAI-compatible tools format)
52
  - ✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
53
- - ✅ Free on HuggingFace Inference API
54
  - ✅ 72B parameters - sufficient intelligence for GAIA tasks
55
 
56
  **Considered alternatives:**
@@ -102,7 +102,7 @@
102
 
103
  ## Outcome
104
 
105
- Successfully integrated HuggingFace Inference API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.
106
 
107
  **Deliverables:**
108
 
 
1
+ # [dev_260103_16] HuggingFace LLM API Integration
2
 
3
  **Date:** 2026-01-03
4
  **Type:** Development
 
27
 
28
  ## Key Decisions
29
 
30
+ ### **Decision 1: HuggingFace LLM API over Ollama (local LLMs)**
31
 
32
  **Why chosen:**
33
 
 
50
 
51
  - ✅ Excellent function calling capabilities (OpenAI-compatible tools format)
52
  - ✅ Strong reasoning performance (competitive with GPT-4 on benchmarks)
53
+ - ✅ Free on HuggingFace LLM API
54
  - ✅ 72B parameters - sufficient intelligence for GAIA tasks
55
 
56
  **Considered alternatives:**
 
102
 
103
  ## Outcome
104
 
105
+ Successfully integrated HuggingFace LLM API as free LLM fallback tier, completing Stage 4 MVP with robust multi-tier resilience.
106
 
107
  **Deliverables:**
108
 
exports/gaia_results_20260104_005516.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GAIA Agent Evaluation Results
2
+
3
+ **Generated:** 2026-01-04 00:55:16
4
+
5
+ ## Submission Status
6
+
7
+ Submission Successful!
8
+ User: mangoobee
9
+ Overall Score: 0.0% (0/20 correct)
10
+ Message: Score calculated successfully: 0/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.
11
+
12
+ ## Questions and Answers
13
+
14
+ | Task ID | Question | Submitted Answer |
15
+ |---------|----------|------------------|
16
+ | 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can ... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
17
+ | a1e91b78-d3d8-4675-bb8d-62741b4b68a6 | In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird spec... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
18
+ | 2d83110e-a098-4ebb-9987-066c06fa42d0 | .rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
19
+ | cca530fc-4052-43b2-b130-b30968d8aa44 | Review the chess position provided in the image. It is black's turn. Provide the correct next mov... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
20
+ | 4fc2f1ae-8625-45b5-ab34-ad4433bc21f8 | Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
21
+ | 6f37996b-2ac7-44b0-8e68-6d28256631b4 | Given this table defining * on the set S = {a, b, c, d, e} \|*\|a\|b\|c\|d\|e\| \|---\|---\|---\... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
22
+ | 9d191bce-651d-4746-be2d-7ef8ecadb9c2 | Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec. What does Teal'c say in respon... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
23
+ | cabe07ed-9eca-40ea-8ead-410ef5e83f91 | What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry mate... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
24
+ | 3cef3a44-215e-4aed-8e3b-b1e3f08063b7 | I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler w... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
25
+ | 99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3 | Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
26
+ | 305ac316-eef6-4446-960a-92d80d542f82 | Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
27
+ | f918266a-b3e0-4914-865d-4faa564f1aef | What is the final numeric output from the attached Python code? | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
28
+ | 3f57289b-8c60-48be-bd80-01f8099ca449 | How many at bats did the Yankee with the most walks in the 1977 regular season have that same sea... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
29
+ | 1f975693-876d-457b-a649-393859e79bf3 | Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
30
+ | 840bfca7-4f7b-481a-8794-c560c340185d | On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This art... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
31
+ | bda648d7-d618-4883-88f4-3466eabd860e | Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
32
+ | cf106601-ab4f-4af9-b045-5295fe67b37d | What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
33
+ | a0c07678-e491-4bbc-8f0b-07405144218f | Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
34
+ | 7bd855d8-463d-4ed5-93ca-5fe35145f733 | The attached Excel file contains the sales of menu items for a local fast-food chain. What were t... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
35
+ | 5a0c1adf-205e-4841-a666-7c3ef95def9d | What is the first name of the only Malko Competition recipient from the 20th Century (after 1977)... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
exports/gaia_results_20260104_005610.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GAIA Agent Evaluation Results
2
+
3
+ **Generated:** 2026-01-04 00:56:10
4
+
5
+ ## Submission Status
6
+
7
+ Submission Successful!
8
+ User: mangoobee
9
+ Overall Score: 0.0% (0/20 correct)
10
+ Message: Score calculated successfully: 0/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.
11
+
12
+ ## Questions and Answers
13
+
14
+ | Task ID | Question | Submitted Answer |
15
+ |---------|----------|------------------|
16
+ | 8e867cd7-cff9-4e6c-867a-ff5ddc2550be | How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can ... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
17
+ | a1e91b78-d3d8-4675-bb8d-62741b4b68a6 | In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird spec... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
18
+ | 2d83110e-a098-4ebb-9987-066c06fa42d0 | .rewsna eht sa "tfel" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
19
+ | cca530fc-4052-43b2-b130-b30968d8aa44 | Review the chess position provided in the image. It is black's turn. Provide the correct next mov... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
20
+ | 4fc2f1ae-8625-45b5-ab34-ad4433bc21f8 | Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
21
+ | 6f37996b-2ac7-44b0-8e68-6d28256631b4 | Given this table defining * on the set S = {a, b, c, d, e} \|*\|a\|b\|c\|d\|e\| \|---\|---\|---\... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
22
+ | 9d191bce-651d-4746-be2d-7ef8ecadb9c2 | Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec. What does Teal'c say in respon... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
23
+ | cabe07ed-9eca-40ea-8ead-410ef5e83f91 | What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry mate... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
24
+ | 3cef3a44-215e-4aed-8e3b-b1e3f08063b7 | I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler w... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
25
+ | 99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3 | Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
26
+ | 305ac316-eef6-4446-960a-92d80d542f82 | Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play i... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
27
+ | f918266a-b3e0-4914-865d-4faa564f1aef | What is the final numeric output from the attached Python code? | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
28
+ | 3f57289b-8c60-48be-bd80-01f8099ca449 | How many at bats did the Yankee with the most walks in the 1977 regular season have that same sea... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
29
+ | 1f975693-876d-457b-a649-393859e79bf3 | Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study fo... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
30
+ | 840bfca7-4f7b-481a-8794-c560c340185d | On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This art... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
31
+ | bda648d7-d618-4883-88f4-3466eabd860e | Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
32
+ | cf106601-ab4f-4af9-b045-5295fe67b37d | What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
33
+ | a0c07678-e491-4bbc-8f0b-07405144218f | Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
34
+ | 7bd855d8-463d-4ed5-93ca-5fe35145f733 | The attached Excel file contains the sales of menu items for a local fast-food chain. What were t... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |
35
+ | 5a0c1adf-205e-4841-a666-7c3ef95def9d | What is the first name of the only Malko Competition recipient from the 20th Century (after 1977)... | ERROR: No evidence collected. Details: Planning error: Exception: Planning failed with all LLMs. ... |