mangubee commited on
Commit
9fb23b8
·
1 Parent(s): 0292109

Integrate benchmark dataset with results from HF as groundtruth

Browse files
CHANGELOG.md CHANGED
@@ -335,32 +335,111 @@
335
  - ✅ Correct answer parsing from submission response implemented
336
  - ⏳ Testing with real GAIA submission pending
337
 
338
- ### [BUGFIX: Useless "Correct?" Column Message - Remove When No Data]
339
 
340
- **Problem:** "Correct?" column shows "See summary: 2/20 correct" for every row when GAIA API doesn't provide per-question correctness data. This is useless and clutters the table.
341
 
342
- **Root Cause:** GAIA API response doesn't include per-question correctness in `result_data["results"]`, only summary stats (`correct_count`, `total_attempted`). Code fell through to else clause showing same message for all rows.
 
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
  **Modified Files:**
345
 
346
- - **app.py** (~5 lines modified)
347
- - Updated correct answer column logic (lines 406-410)
348
- - Removed fallback "See summary" message
349
- - Now only adds "Correct?" column if per-question correctness data available
350
- - If no per-question data, column is simply omitted from results table
 
 
 
 
 
351
 
352
  **Solution:**
353
 
354
- - When `correct_task_ids` is empty (no per-question data), don't add "Correct?" column at all
355
- - JSON export still includes `"correct": null` for proper data structure
356
- - User sees score summary in submission status message instead
 
357
 
358
  **Verification:**
359
 
360
- - ✅ No useless repetitive message in results table
361
- - ✅ Column only appears when API provides per-question correctness
362
- - Testing with real GAIA submission pending
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363
 
364
  ### Created Files
365
 
 
 
 
366
  ### Deleted Files
 
335
  - ✅ Correct answer parsing from submission response implemented
336
  - ⏳ Testing with real GAIA submission pending
337
 
338
+ ### [BUGFIX: GAIA API Limitation - Per-Question Correctness Unavailable]
339
 
340
+ **Problem:** User reported "Correct?" column showing "null" in JSON export and missing from UI table. Investigation revealed GAIA API doesn't provide per-question correctness data.
341
 
342
+ **Root Cause:** GAIA API response structure only includes summary stats:
343
+
344
+ ```json
345
+ {
346
+ "username": "...",
347
+ "score": 5.0,
348
+ "correct_count": 1,
349
+ "total_attempted": 3,
350
+ "message": "...",
351
+ "timestamp": "..."
352
+ }
353
+ ```
354
+
355
+ No "results" array exists with per-question correctness. API tells us "1/3 correct" but NOT which specific questions are correct.
356
 
357
  **Modified Files:**
358
 
359
+ - **.env** (~2 lines added)
360
+ - Added `DEBUG_QUESTION_LIMIT=3` - Limit questions for faster API response debugging (0 = process all)
361
+
362
+ - **app.py** (~40 lines modified)
363
+ - Removed useless `correct_task_ids` extraction logic (lines 452-457 deleted)
364
+ - Removed useless "Correct?" column addition logic (lines 460-465 deleted)
365
+ - Added clear comment documenting API limitation (lines 444-447)
366
+ - Updated `export_results_to_json()` - Removed extraction logic (lines 78-84 deleted)
367
+ - Simplified JSON export - Hardcoded `"correct": None` with explanatory comment (lines 106-107)
368
+ - Added `DEBUG_QUESTION_LIMIT` support for faster testing (lines 320-324)
369
 
370
  **Solution:**
371
 
372
+ - UI table: No "Correct?" column (cleanly omitted, not showing useless data)
373
+ - JSON export: `"correct": null` for all questions (API doesn't provide this data)
374
+ - Metadata: Includes summary stats (`score_percent`, `correct_count`, `total_attempted`)
375
+ - User sees score summary in submission status message: "5.0% (1/3 correct)"
376
 
377
  **Verification:**
378
 
379
+ - ✅ Debug logging confirmed API response structure (no "results" field)
380
+ - ✅ Cleaned up ~30 lines of useless extraction code
381
+ - Clear comments document the limitation for future maintainers
382
+ - ✅ JSON export maintains data structure with explicit null values
383
+
384
+ ### [FEATURE: Ground Truth Comparison - GAIA Validation Dataset Integration]
385
+
386
+ **Problem:** GAIA API doesn't provide per-question correctness, making it impossible to debug which specific questions are failing. Need local ground truth comparison for development.
387
+
388
+ **Solution:** Integrate GAIA validation dataset from HuggingFace to compare submitted answers against ground truth locally.
389
+
390
+ **Modified Files:**
391
+
392
+ - **pyproject.toml / requirements.txt** (~2 packages added)
393
+ - Added `datasets>=4.4.2` - HuggingFace datasets library
394
+ - Added `huggingface-hub` - Dataset download and caching
395
+
396
+ - **src/utils/ground_truth.py** (NEW - ~120 lines)
397
+ - Created `GAIAGroundTruth` class - Loads validation dataset, provides ground truth answers
398
+ - `load_validation_set()` - Downloads GAIA validation set (2023_all split)
399
+ - `get_answer(task_id)` - Returns ground truth answer for a question
400
+ - `compare_answer(task_id, submitted_answer)` - Compares submitted vs ground truth (exact match)
401
+ - Singleton pattern with `get_ground_truth()` helper
402
+ - Caches dataset to `~/.cache/gaia_dataset` for fast reloading
403
+
404
+ - **src/utils/__init__.py** (NEW - ~7 lines)
405
+ - Package initialization for utils module
406
+
407
+ - **app.py** (~25 lines modified)
408
+ - Added import: `from src.utils.ground_truth import get_ground_truth` (line 15)
409
+ - Added ground truth loading after fetching questions (lines 357-362)
410
+ - Updated results collection to include ground truth comparison (lines 386-398)
411
+ - Calls `ground_truth.compare_answer()` for each result
412
+ - Adds "Correct?" column to results_log if ground truth available
413
+ - Shows "✅ Yes" or "❌ No" in UI table
414
+ - Updated JSON export to include ground truth correctness (lines 110-112)
415
+ - Converts "✅ Yes" → true, "❌ No" → false, missing → null
416
+
417
+ **Benefits:**
418
+
419
+ - ✅ **Local debugging:** See which specific questions are correct/incorrect without API dependency
420
+ - ✅ **Validation set only:** Only works on public validation questions (test set has private answers)
421
+ - ✅ **UI visibility:** "Correct?" column appears in results table when ground truth available
422
+ - ✅ **JSON export:** Per-question `"correct": true/false` for error analysis
423
+ - ✅ **Fast caching:** Dataset downloaded once, cached locally for reuse
424
+ - ✅ **Graceful fallback:** If dataset unavailable, system continues without ground truth
425
+
426
+ **Dataset Structure:**
427
+
428
+ ```python
429
+ # GAIA validation dataset (2023_all split)
430
+ # Fields: task_id, Question, Level, Final answer, file_name, file_path, Annotator Metadata
431
+ # ~165 validation questions with ground truth answers
432
+ ```
433
+
434
+ **Verification:**
435
+
436
+ - ⏳ Testing with validation set questions pending
437
+ - ⏳ Verify exact match comparison works correctly
438
+ - ⏳ Check performance with dataset caching
439
 
440
  ### Created Files
441
 
442
+ - src/utils/ground_truth.py
443
+ - src/utils/__init__.py
444
+
445
  ### Deleted Files
app.py CHANGED
@@ -11,10 +11,12 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
11
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
12
  from src.agent import GAIAAgent
13
 
 
 
 
14
  # Configure logging
15
  logging.basicConfig(
16
- level=logging.INFO,
17
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
18
  )
19
  logger = logging.getLogger(__name__)
20
 
@@ -27,17 +29,27 @@ DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
27
  def check_api_keys():
28
  """Check which API keys are configured."""
29
  keys_status = {
30
- "GOOGLE_API_KEY (Gemini)": "✓ SET" if os.getenv("GOOGLE_API_KEY") else "✗ MISSING",
 
 
31
  "HF_TOKEN (HuggingFace)": "✓ SET" if os.getenv("HF_TOKEN") else "✗ MISSING",
32
- "ANTHROPIC_API_KEY (Claude)": "✓ SET" if os.getenv("ANTHROPIC_API_KEY") else "✗ MISSING",
33
- "TAVILY_API_KEY (Search)": "✓ SET" if os.getenv("TAVILY_API_KEY") else "✗ MISSING",
 
 
 
 
34
  "EXA_API_KEY (Search)": "✓ SET" if os.getenv("EXA_API_KEY") else "✗ MISSING",
35
  }
36
  return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
37
 
38
 
39
- def export_results_to_json(results_log: list, submission_status: str, execution_time: float = None,
40
- submission_response: dict = None) -> str:
 
 
 
 
41
  """Export evaluation results to JSON file for easy processing.
42
 
43
  - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
@@ -66,27 +78,21 @@ def export_results_to_json(results_log: list, submission_status: str, execution_
66
  downloads_dir = os.path.expanduser("~/Downloads")
67
  filepath = os.path.join(downloads_dir, filename)
68
 
69
- # Extract correctness info from submission response if available
70
- correct_task_ids = set()
71
- if submission_response and "results" in submission_response:
72
- # If API provides per-question results
73
- for item in submission_response.get("results", []):
74
- if item.get("correct"):
75
- correct_task_ids.add(item.get("task_id"))
76
-
77
  # Build JSON structure
78
  metadata = {
79
  "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
80
  "timestamp": timestamp,
81
- "total_questions": len(results_log)
82
  }
83
 
84
  # Add execution time if available
85
  if execution_time is not None:
86
  metadata["execution_time_seconds"] = round(execution_time, 2)
87
- metadata["execution_time_formatted"] = f"{int(execution_time // 60)}m {int(execution_time % 60)}s"
 
 
88
 
89
- # Add score info if available
90
  if submission_response:
91
  metadata["score_percent"] = submission_response.get("score")
92
  metadata["correct_count"] = submission_response.get("correct_count")
@@ -100,14 +106,17 @@ def export_results_to_json(results_log: list, submission_status: str, execution_
100
  "task_id": result.get("Task ID", "N/A"),
101
  "question": result.get("Question", "N/A"),
102
  "submitted_answer": result.get("Submitted Answer", "N/A"),
103
- "correct": result.get("Task ID") in correct_task_ids if correct_task_ids else None
 
 
 
104
  }
105
  for result in results_log
106
- ]
107
  }
108
 
109
  # Write JSON file with pretty formatting
110
- with open(filepath, 'w', encoding='utf-8') as f:
111
  json.dump(export_data, f, indent=2, ensure_ascii=False)
112
 
113
  logger.info(f"Results exported to: {filepath}")
@@ -122,39 +131,43 @@ def format_diagnostics(final_state: dict) -> str:
122
  diagnostics.append(f"**Question:** {final_state.get('question', 'N/A')}\n")
123
 
124
  # Plan
125
- plan = final_state.get('plan', 'No plan generated')
126
  diagnostics.append(f"**Plan:**\n{plan}\n")
127
 
128
  # Tool calls
129
- tool_calls = final_state.get('tool_calls', [])
130
  if tool_calls:
131
  diagnostics.append(f"**Tools Selected:** {len(tool_calls)} tool(s)")
132
  for idx, tc in enumerate(tool_calls, 1):
133
- tool_name = tc.get('tool', 'unknown')
134
- params = tc.get('params', {})
135
  diagnostics.append(f" {idx}. {tool_name}({params})")
136
  diagnostics.append("")
137
  else:
138
  diagnostics.append("**Tools Selected:** None\n")
139
 
140
  # Tool results
141
- tool_results = final_state.get('tool_results', [])
142
  if tool_results:
143
  diagnostics.append(f"**Tool Execution Results:** {len(tool_results)} result(s)")
144
  for idx, tr in enumerate(tool_results, 1):
145
- tool_name = tr.get('tool', 'unknown')
146
- status = tr.get('status', 'unknown')
147
- if status == 'success':
148
- result_preview = str(tr.get('result', ''))[:100] + "..." if len(str(tr.get('result', ''))) > 100 else str(tr.get('result', ''))
 
 
 
 
149
  diagnostics.append(f" {idx}. {tool_name}: ✓ SUCCESS")
150
  diagnostics.append(f" Result: {result_preview}")
151
  else:
152
- error = tr.get('error', 'Unknown error')
153
  diagnostics.append(f" {idx}. {tool_name}: ✗ FAILED - {error}")
154
  diagnostics.append("")
155
 
156
  # Evidence
157
- evidence = final_state.get('evidence', [])
158
  if evidence:
159
  diagnostics.append(f"**Evidence Collected:** {len(evidence)} item(s)")
160
  for idx, ev in enumerate(evidence, 1):
@@ -165,7 +178,7 @@ def format_diagnostics(final_state: dict) -> str:
165
  diagnostics.append("**Evidence Collected:** None\n")
166
 
167
  # Errors
168
- errors = final_state.get('errors', [])
169
  if errors:
170
  diagnostics.append(f"**Errors:** {len(errors)} error(s)")
171
  for idx, err in enumerate(errors, 1):
@@ -173,7 +186,7 @@ def format_diagnostics(final_state: dict) -> str:
173
  diagnostics.append("")
174
 
175
  # Answer
176
- answer = final_state.get('answer', 'No answer generated')
177
  diagnostics.append(f"**Final Answer:** {answer}")
178
 
179
  return "\n".join(diagnostics)
@@ -189,7 +202,9 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
189
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
190
  os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
191
 
192
- logger.info(f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
 
 
193
 
194
  # Initialize agent
195
  agent = GAIAAgent()
@@ -242,31 +257,33 @@ def process_single_question(agent, item, index, total):
242
  "task_id": task_id,
243
  "question": question_text,
244
  "answer": "ERROR: Missing task_id or question",
245
- "error": True
246
  }
247
 
248
  try:
249
- logger.info(f"[{index+1}/{total}] Processing {task_id[:8]}...")
250
  submitted_answer = agent(question_text)
251
- logger.info(f"[{index+1}/{total}] Completed {task_id[:8]}")
252
 
253
  return {
254
  "task_id": task_id,
255
  "question": question_text,
256
  "answer": submitted_answer,
257
- "error": False
258
  }
259
  except Exception as e:
260
- logger.error(f"[{index+1}/{total}] Error {task_id[:8]}: {e}")
261
  return {
262
  "task_id": task_id,
263
  "question": question_text,
264
  "answer": f"ERROR: {str(e)}",
265
- "error": True
266
  }
267
 
268
 
269
- def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAuthProfile | None = None):
 
 
270
  """
271
  Fetches all questions, runs the BasicAgent on them, submits all answers,
272
  and displays the results.
@@ -291,7 +308,9 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
291
  # Set LLM provider from UI selection (overrides .env)
292
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
293
  os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
294
- logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
 
 
295
 
296
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
297
  try:
@@ -315,7 +334,17 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
315
  if not questions_data:
316
  print("Fetched questions list is empty.")
317
  return "Fetched questions list is empty or invalid format.", None, ""
318
- print(f"Fetched {len(questions_data)} questions.")
 
 
 
 
 
 
 
 
 
 
319
  except requests.exceptions.RequestException as e:
320
  print(f"Error fetching questions: {e}")
321
  return f"Error fetching questions: {e}", None, ""
@@ -327,17 +356,28 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
327
  print(f"An unexpected error occurred fetching questions: {e}")
328
  return f"An unexpected error occurred fetching questions: {e}", None, ""
329
 
 
 
 
 
 
 
 
330
  # 3. Run your Agent (Stage 6: Concurrent processing)
331
  max_workers = int(os.getenv("MAX_CONCURRENT_WORKERS", "5"))
332
  results_log = []
333
  answers_payload = []
334
 
335
- logger.info(f"Running agent on {len(questions_data)} questions with {max_workers} workers...")
 
 
336
 
337
  with ThreadPoolExecutor(max_workers=max_workers) as executor:
338
  # Submit all questions for concurrent processing
339
  future_to_index = {
340
- executor.submit(process_single_question, agent, item, idx, len(questions_data)): idx
 
 
341
  for idx, item in enumerate(questions_data)
342
  }
343
 
@@ -345,29 +385,41 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
345
  for future in as_completed(future_to_index):
346
  result = future.result()
347
 
 
 
 
348
  # Add to results log
349
- results_log.append({
350
  "Task ID": result["task_id"],
351
  "Question": result["question"],
352
  "Submitted Answer": result["answer"],
353
- })
 
 
 
 
 
 
354
 
355
  # Add to submission payload if no error
356
  if not result["error"]:
357
- answers_payload.append({
358
- "task_id": result["task_id"],
359
- "submitted_answer": result["answer"]
360
- })
361
 
362
  # Log progress
363
- logger.info(f"Progress: {len(results_log)}/{len(questions_data)} questions processed")
 
 
364
 
365
  if not answers_payload:
366
  print("Agent did not produce any answers to submit.")
367
  status_message = "Agent did not produce any answers to submit."
368
  results_df = pd.DataFrame(results_log)
369
  execution_time = time.time() - start_time
370
- export_path = export_results_to_json(results_log, status_message, execution_time)
 
 
371
  return status_message, results_df, export_path
372
 
373
  # 4. Prepare Submission
@@ -385,6 +437,7 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
385
  response = requests.post(submit_url, json=submission_data, timeout=60)
386
  response.raise_for_status()
387
  result_data = response.json()
 
388
  final_status = (
389
  f"Submission Successful!\n"
390
  f"User: {result_data.get('username')}\n"
@@ -394,24 +447,20 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
394
  )
395
  print("Submission successful.")
396
  execution_time = time.time() - start_time
397
- logger.info(f"Total execution time: {execution_time:.2f} seconds ({int(execution_time // 60)}m {int(execution_time % 60)}s)")
398
-
399
- # Extract correct task_ids from result_data if available
400
- correct_task_ids = set()
401
- if "results" in result_data:
402
- for item in result_data.get("results", []):
403
- if item.get("correct"):
404
- correct_task_ids.add(item.get("task_id"))
405
 
406
- # Add "Correct?" column to results (only if we have per-question correctness data)
407
- if correct_task_ids:
408
- for result in results_log:
409
- task_id = result.get("Task ID")
410
- result["Correct?"] = "✅ Yes" if task_id in correct_task_ids else "❌ No"
411
 
412
  results_df = pd.DataFrame(results_log)
413
  # Export to JSON with execution time and submission response
414
- export_path = export_results_to_json(results_log, final_status, execution_time, result_data)
 
 
415
  return final_status, results_df, export_path
416
  except requests.exceptions.HTTPError as e:
417
  error_detail = f"Server responded with status {e.response.status_code}."
@@ -424,28 +473,36 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
424
  print(status_message)
425
  execution_time = time.time() - start_time
426
  results_df = pd.DataFrame(results_log)
427
- export_path = export_results_to_json(results_log, status_message, execution_time)
 
 
428
  return status_message, results_df, export_path
429
  except requests.exceptions.Timeout:
430
  status_message = "Submission Failed: The request timed out."
431
  print(status_message)
432
  execution_time = time.time() - start_time
433
  results_df = pd.DataFrame(results_log)
434
- export_path = export_results_to_json(results_log, status_message, execution_time)
 
 
435
  return status_message, results_df, export_path
436
  except requests.exceptions.RequestException as e:
437
  status_message = f"Submission Failed: Network error - {e}"
438
  print(status_message)
439
  execution_time = time.time() - start_time
440
  results_df = pd.DataFrame(results_log)
441
- export_path = export_results_to_json(results_log, status_message, execution_time)
 
 
442
  return status_message, results_df, export_path
443
  except Exception as e:
444
  status_message = f"An unexpected error occurred during submission: {e}"
445
  print(status_message)
446
  execution_time = time.time() - start_time
447
  results_df = pd.DataFrame(results_log)
448
- export_path = export_results_to_json(results_log, status_message, execution_time)
 
 
449
  return status_message, results_df, export_path
450
 
451
 
@@ -476,7 +533,7 @@ with gr.Blocks() as demo:
476
  test_question_input = gr.Textbox(
477
  label="Enter Test Question",
478
  placeholder="e.g., What is the capital of France?",
479
- lines=3
480
  )
481
 
482
  with gr.Row():
@@ -484,12 +541,12 @@ with gr.Blocks() as demo:
484
  label="LLM Provider",
485
  choices=["Gemini", "HuggingFace", "Groq", "Claude"],
486
  value="Groq",
487
- info="Select which LLM to use for this test"
488
  )
489
  enable_fallback_checkbox = gr.Checkbox(
490
  label="Enable Fallback",
491
  value=False,
492
- info="If enabled, falls back to other providers on failure"
493
  )
494
 
495
  test_button = gr.Button("Run Test", variant="primary")
@@ -497,26 +554,24 @@ with gr.Blocks() as demo:
497
  with gr.Row():
498
  with gr.Column(scale=1):
499
  test_answer_output = gr.Textbox(
500
- label="Answer",
501
- lines=3,
502
- interactive=False
503
  )
504
  test_api_status = gr.Textbox(
505
- label="API Keys Status",
506
- lines=5,
507
- interactive=False
508
  )
509
  with gr.Column(scale=2):
510
  test_diagnostics_output = gr.Textbox(
511
- label="Execution Diagnostics",
512
- lines=20,
513
- interactive=False
514
  )
515
 
516
  test_button.click(
517
  fn=test_single_question,
518
- inputs=[test_question_input, llm_provider_dropdown, enable_fallback_checkbox],
519
- outputs=[test_answer_output, test_diagnostics_output, test_api_status]
 
 
 
 
520
  )
521
 
522
  # Tab 2: Full Evaluation (existing functionality)
@@ -543,12 +598,12 @@ with gr.Blocks() as demo:
543
  label="LLM Provider for Evaluation",
544
  choices=["Gemini", "HuggingFace", "Groq", "Claude"],
545
  value="Groq",
546
- info="Select which LLM to use for all questions"
547
  )
548
  eval_enable_fallback_checkbox = gr.Checkbox(
549
  label="Enable Fallback",
550
  value=True,
551
- info="Recommended: Enable fallback for production evaluation"
552
  )
553
 
554
  run_button = gr.Button("Run Evaluation & Submit All Answers")
@@ -559,15 +614,12 @@ with gr.Blocks() as demo:
559
  # Removed max_rows=10 from DataFrame constructor
560
  results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
561
 
562
- export_output = gr.File(
563
- label="Download Results",
564
- type="filepath"
565
- )
566
 
567
  run_button.click(
568
  fn=run_and_submit_all,
569
  inputs=[eval_llm_provider_dropdown, eval_enable_fallback_checkbox],
570
- outputs=[status_output, results_table, export_output]
571
  )
572
 
573
  if __name__ == "__main__":
 
11
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
12
  from src.agent import GAIAAgent
13
 
14
+ # Import ground truth comparison
15
+ from src.utils.ground_truth import get_ground_truth
16
+
17
  # Configure logging
18
  logging.basicConfig(
19
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
 
20
  )
21
  logger = logging.getLogger(__name__)
22
 
 
29
  def check_api_keys():
30
  """Check which API keys are configured."""
31
  keys_status = {
32
+ "GOOGLE_API_KEY (Gemini)": "✓ SET"
33
+ if os.getenv("GOOGLE_API_KEY")
34
+ else "✗ MISSING",
35
  "HF_TOKEN (HuggingFace)": "✓ SET" if os.getenv("HF_TOKEN") else "✗ MISSING",
36
+ "ANTHROPIC_API_KEY (Claude)": "✓ SET"
37
+ if os.getenv("ANTHROPIC_API_KEY")
38
+ else "✗ MISSING",
39
+ "TAVILY_API_KEY (Search)": "✓ SET"
40
+ if os.getenv("TAVILY_API_KEY")
41
+ else "✗ MISSING",
42
  "EXA_API_KEY (Search)": "✓ SET" if os.getenv("EXA_API_KEY") else "✗ MISSING",
43
  }
44
  return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
45
 
46
 
47
+ def export_results_to_json(
48
+ results_log: list,
49
+ submission_status: str,
50
+ execution_time: float = None,
51
+ submission_response: dict = None,
52
+ ) -> str:
53
  """Export evaluation results to JSON file for easy processing.
54
 
55
  - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
 
78
  downloads_dir = os.path.expanduser("~/Downloads")
79
  filepath = os.path.join(downloads_dir, filename)
80
 
 
 
 
 
 
 
 
 
81
  # Build JSON structure
82
  metadata = {
83
  "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
84
  "timestamp": timestamp,
85
+ "total_questions": len(results_log),
86
  }
87
 
88
  # Add execution time if available
89
  if execution_time is not None:
90
  metadata["execution_time_seconds"] = round(execution_time, 2)
91
+ metadata["execution_time_formatted"] = (
92
+ f"{int(execution_time // 60)}m {int(execution_time % 60)}s"
93
+ )
94
 
95
+ # Add score info if available (summary stats only - no per-question correctness)
96
  if submission_response:
97
  metadata["score_percent"] = submission_response.get("score")
98
  metadata["correct_count"] = submission_response.get("correct_count")
 
106
  "task_id": result.get("Task ID", "N/A"),
107
  "question": result.get("Question", "N/A"),
108
  "submitted_answer": result.get("Submitted Answer", "N/A"),
109
+ # Use ground truth comparison if available, otherwise null
110
+ "correct": True if result.get("Correct?") == "✅ Yes"
111
+ else False if result.get("Correct?") == "❌ No"
112
+ else None,
113
  }
114
  for result in results_log
115
+ ],
116
  }
117
 
118
  # Write JSON file with pretty formatting
119
+ with open(filepath, "w", encoding="utf-8") as f:
120
  json.dump(export_data, f, indent=2, ensure_ascii=False)
121
 
122
  logger.info(f"Results exported to: {filepath}")
 
131
  diagnostics.append(f"**Question:** {final_state.get('question', 'N/A')}\n")
132
 
133
  # Plan
134
+ plan = final_state.get("plan", "No plan generated")
135
  diagnostics.append(f"**Plan:**\n{plan}\n")
136
 
137
  # Tool calls
138
+ tool_calls = final_state.get("tool_calls", [])
139
  if tool_calls:
140
  diagnostics.append(f"**Tools Selected:** {len(tool_calls)} tool(s)")
141
  for idx, tc in enumerate(tool_calls, 1):
142
+ tool_name = tc.get("tool", "unknown")
143
+ params = tc.get("params", {})
144
  diagnostics.append(f" {idx}. {tool_name}({params})")
145
  diagnostics.append("")
146
  else:
147
  diagnostics.append("**Tools Selected:** None\n")
148
 
149
  # Tool results
150
+ tool_results = final_state.get("tool_results", [])
151
  if tool_results:
152
  diagnostics.append(f"**Tool Execution Results:** {len(tool_results)} result(s)")
153
  for idx, tr in enumerate(tool_results, 1):
154
+ tool_name = tr.get("tool", "unknown")
155
+ status = tr.get("status", "unknown")
156
+ if status == "success":
157
+ result_preview = (
158
+ str(tr.get("result", ""))[:100] + "..."
159
+ if len(str(tr.get("result", ""))) > 100
160
+ else str(tr.get("result", ""))
161
+ )
162
  diagnostics.append(f" {idx}. {tool_name}: ✓ SUCCESS")
163
  diagnostics.append(f" Result: {result_preview}")
164
  else:
165
+ error = tr.get("error", "Unknown error")
166
  diagnostics.append(f" {idx}. {tool_name}: ✗ FAILED - {error}")
167
  diagnostics.append("")
168
 
169
  # Evidence
170
+ evidence = final_state.get("evidence", [])
171
  if evidence:
172
  diagnostics.append(f"**Evidence Collected:** {len(evidence)} item(s)")
173
  for idx, ev in enumerate(evidence, 1):
 
178
  diagnostics.append("**Evidence Collected:** None\n")
179
 
180
  # Errors
181
+ errors = final_state.get("errors", [])
182
  if errors:
183
  diagnostics.append(f"**Errors:** {len(errors)} error(s)")
184
  for idx, err in enumerate(errors, 1):
 
186
  diagnostics.append("")
187
 
188
  # Answer
189
+ answer = final_state.get("answer", "No answer generated")
190
  diagnostics.append(f"**Final Answer:** {answer}")
191
 
192
  return "\n".join(diagnostics)
 
202
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
203
  os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
204
 
205
+ logger.info(
206
+ f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
207
+ )
208
 
209
  # Initialize agent
210
  agent = GAIAAgent()
 
257
  "task_id": task_id,
258
  "question": question_text,
259
  "answer": "ERROR: Missing task_id or question",
260
+ "error": True,
261
  }
262
 
263
  try:
264
+ logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
265
  submitted_answer = agent(question_text)
266
+ logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
267
 
268
  return {
269
  "task_id": task_id,
270
  "question": question_text,
271
  "answer": submitted_answer,
272
+ "error": False,
273
  }
274
  except Exception as e:
275
+ logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
276
  return {
277
  "task_id": task_id,
278
  "question": question_text,
279
  "answer": f"ERROR: {str(e)}",
280
+ "error": True,
281
  }
282
 
283
 
284
+ def run_and_submit_all(
285
+ llm_provider: str, enable_fallback: bool, profile: gr.OAuthProfile | None = None
286
+ ):
287
  """
288
  Fetches all questions, runs the BasicAgent on them, submits all answers,
289
  and displays the results.
 
308
  # Set LLM provider from UI selection (overrides .env)
309
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
310
  os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
311
+ logger.info(
312
+ f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
313
+ )
314
 
315
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
316
  try:
 
334
  if not questions_data:
335
  print("Fetched questions list is empty.")
336
  return "Fetched questions list is empty or invalid format.", None, ""
337
+
338
+ # Apply debug limit if configured
339
+ debug_limit = int(os.getenv("DEBUG_QUESTION_LIMIT", "0"))
340
+ if debug_limit > 0:
341
+ questions_data = questions_data[:debug_limit]
342
+ logger.warning(f"DEBUG MODE: Limited to first {debug_limit} questions")
343
+ print(
344
+ f"DEBUG MODE: Processing only {debug_limit} questions (set DEBUG_QUESTION_LIMIT=0 to disable)"
345
+ )
346
+
347
+ print(f"Processing {len(questions_data)} questions.")
348
  except requests.exceptions.RequestException as e:
349
  print(f"Error fetching questions: {e}")
350
  return f"Error fetching questions: {e}", None, ""
 
356
  print(f"An unexpected error occurred fetching questions: {e}")
357
  return f"An unexpected error occurred fetching questions: {e}", None, ""
358
 
359
+ # 2.5. Load ground truth for local comparison (validation set only)
360
+ ground_truth = get_ground_truth()
361
+ if ground_truth.load_validation_set():
362
+ logger.info("Ground truth loaded - per-question correctness will be available")
363
+ else:
364
+ logger.warning("Ground truth not loaded - per-question correctness unavailable")
365
+
366
  # 3. Run your Agent (Stage 6: Concurrent processing)
367
  max_workers = int(os.getenv("MAX_CONCURRENT_WORKERS", "5"))
368
  results_log = []
369
  answers_payload = []
370
 
371
+ logger.info(
372
+ f"Running agent on {len(questions_data)} questions with {max_workers} workers..."
373
+ )
374
 
375
  with ThreadPoolExecutor(max_workers=max_workers) as executor:
376
  # Submit all questions for concurrent processing
377
  future_to_index = {
378
+ executor.submit(
379
+ process_single_question, agent, item, idx, len(questions_data)
380
+ ): idx
381
  for idx, item in enumerate(questions_data)
382
  }
383
 
 
385
  for future in as_completed(future_to_index):
386
  result = future.result()
387
 
388
+ # Compare with ground truth if available
389
+ is_correct = ground_truth.compare_answer(result["task_id"], result["answer"])
390
+
391
  # Add to results log
392
+ result_entry = {
393
  "Task ID": result["task_id"],
394
  "Question": result["question"],
395
  "Submitted Answer": result["answer"],
396
+ }
397
+
398
+ # Add "Correct?" column if ground truth available
399
+ if is_correct is not None:
400
+ result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
401
+
402
+ results_log.append(result_entry)
403
 
404
  # Add to submission payload if no error
405
  if not result["error"]:
406
+ answers_payload.append(
407
+ {"task_id": result["task_id"], "submitted_answer": result["answer"]}
408
+ )
 
409
 
410
  # Log progress
411
+ logger.info(
412
+ f"Progress: {len(results_log)}/{len(questions_data)} questions processed"
413
+ )
414
 
415
  if not answers_payload:
416
  print("Agent did not produce any answers to submit.")
417
  status_message = "Agent did not produce any answers to submit."
418
  results_df = pd.DataFrame(results_log)
419
  execution_time = time.time() - start_time
420
+ export_path = export_results_to_json(
421
+ results_log, status_message, execution_time
422
+ )
423
  return status_message, results_df, export_path
424
 
425
  # 4. Prepare Submission
 
437
  response = requests.post(submit_url, json=submission_data, timeout=60)
438
  response.raise_for_status()
439
  result_data = response.json()
440
+
441
  final_status = (
442
  f"Submission Successful!\n"
443
  f"User: {result_data.get('username')}\n"
 
447
  )
448
  print("Submission successful.")
449
  execution_time = time.time() - start_time
450
+ logger.info(
451
+ f"Total execution time: {execution_time:.2f} seconds ({int(execution_time // 60)}m {int(execution_time % 60)}s)"
452
+ )
 
 
 
 
 
453
 
454
+ # LIMITATION: GAIA API does NOT provide per-question correctness data
455
+ # API response structure: {username, score, correct_count, total_attempted, message, timestamp}
456
+ # No "results" array exists - we only get summary stats, not which specific questions are correct
457
+ # Therefore: UI table has no "Correct?" column, JSON export shows "correct": null for all questions
 
458
 
459
  results_df = pd.DataFrame(results_log)
460
  # Export to JSON with execution time and submission response
461
+ export_path = export_results_to_json(
462
+ results_log, final_status, execution_time, result_data
463
+ )
464
  return final_status, results_df, export_path
465
  except requests.exceptions.HTTPError as e:
466
  error_detail = f"Server responded with status {e.response.status_code}."
 
473
  print(status_message)
474
  execution_time = time.time() - start_time
475
  results_df = pd.DataFrame(results_log)
476
+ export_path = export_results_to_json(
477
+ results_log, status_message, execution_time
478
+ )
479
  return status_message, results_df, export_path
480
  except requests.exceptions.Timeout:
481
  status_message = "Submission Failed: The request timed out."
482
  print(status_message)
483
  execution_time = time.time() - start_time
484
  results_df = pd.DataFrame(results_log)
485
+ export_path = export_results_to_json(
486
+ results_log, status_message, execution_time
487
+ )
488
  return status_message, results_df, export_path
489
  except requests.exceptions.RequestException as e:
490
  status_message = f"Submission Failed: Network error - {e}"
491
  print(status_message)
492
  execution_time = time.time() - start_time
493
  results_df = pd.DataFrame(results_log)
494
+ export_path = export_results_to_json(
495
+ results_log, status_message, execution_time
496
+ )
497
  return status_message, results_df, export_path
498
  except Exception as e:
499
  status_message = f"An unexpected error occurred during submission: {e}"
500
  print(status_message)
501
  execution_time = time.time() - start_time
502
  results_df = pd.DataFrame(results_log)
503
+ export_path = export_results_to_json(
504
+ results_log, status_message, execution_time
505
+ )
506
  return status_message, results_df, export_path
507
 
508
 
 
533
  test_question_input = gr.Textbox(
534
  label="Enter Test Question",
535
  placeholder="e.g., What is the capital of France?",
536
+ lines=3,
537
  )
538
 
539
  with gr.Row():
 
541
  label="LLM Provider",
542
  choices=["Gemini", "HuggingFace", "Groq", "Claude"],
543
  value="Groq",
544
+ info="Select which LLM to use for this test",
545
  )
546
  enable_fallback_checkbox = gr.Checkbox(
547
  label="Enable Fallback",
548
  value=False,
549
+ info="If enabled, falls back to other providers on failure",
550
  )
551
 
552
  test_button = gr.Button("Run Test", variant="primary")
 
554
  with gr.Row():
555
  with gr.Column(scale=1):
556
  test_answer_output = gr.Textbox(
557
+ label="Answer", lines=3, interactive=False
 
 
558
  )
559
  test_api_status = gr.Textbox(
560
+ label="API Keys Status", lines=5, interactive=False
 
 
561
  )
562
  with gr.Column(scale=2):
563
  test_diagnostics_output = gr.Textbox(
564
+ label="Execution Diagnostics", lines=20, interactive=False
 
 
565
  )
566
 
567
  test_button.click(
568
  fn=test_single_question,
569
+ inputs=[
570
+ test_question_input,
571
+ llm_provider_dropdown,
572
+ enable_fallback_checkbox,
573
+ ],
574
+ outputs=[test_answer_output, test_diagnostics_output, test_api_status],
575
  )
576
 
577
  # Tab 2: Full Evaluation (existing functionality)
 
598
  label="LLM Provider for Evaluation",
599
  choices=["Gemini", "HuggingFace", "Groq", "Claude"],
600
  value="Groq",
601
+ info="Select which LLM to use for all questions",
602
  )
603
  eval_enable_fallback_checkbox = gr.Checkbox(
604
  label="Enable Fallback",
605
  value=True,
606
+ info="Recommended: Enable fallback for production evaluation",
607
  )
608
 
609
  run_button = gr.Button("Run Evaluation & Submit All Answers")
 
614
  # Removed max_rows=10 from DataFrame constructor
615
  results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
616
 
617
+ export_output = gr.File(label="Download Results", type="filepath")
 
 
 
618
 
619
  run_button.click(
620
  fn=run_and_submit_all,
621
  inputs=[eval_llm_provider_dropdown, eval_enable_fallback_checkbox],
622
+ outputs=[status_output, results_table, export_output],
623
  )
624
 
625
  if __name__ == "__main__":
exports/gaia_results_20260104_214534.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "generated": "2026-01-04 21:45:34",
4
+ "timestamp": "20260104_214534",
5
+ "total_questions": 3,
6
+ "execution_time_seconds": 14.57,
7
+ "execution_time_formatted": "0m 14s",
8
+ "score_percent": 5.0,
9
+ "correct_count": 1,
10
+ "total_attempted": 3
11
+ },
12
+ "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/3 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
13
+ "results": [
14
+ {
15
+ "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
16
+ "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
17
+ "submitted_answer": "Unable to answer",
18
+ "correct": null
19
+ },
20
+ {
21
+ "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
22
+ "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
23
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
24
+ "correct": null
25
+ },
26
+ {
27
+ "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
28
+ "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
29
+ "submitted_answer": "right",
30
+ "correct": null
31
+ }
32
+ ]
33
+ }
exports/gaia_results_20260104_220404.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "generated": "2026-01-04 22:04:04",
4
+ "timestamp": "20260104_220404",
5
+ "total_questions": 3,
6
+ "execution_time_seconds": 21.65,
7
+ "execution_time_formatted": "0m 21s",
8
+ "score_percent": 0.0,
9
+ "correct_count": 0,
10
+ "total_attempted": 3
11
+ },
12
+ "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 0.0% (0/3 correct)\nMessage: Score calculated successfully: 0/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
13
+ "results": [
14
+ {
15
+ "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
16
+ "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
17
+ "submitted_answer": "Unable to answer",
18
+ "correct": false
19
+ },
20
+ {
21
+ "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
22
+ "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
23
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
24
+ "correct": false
25
+ },
26
+ {
27
+ "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
28
+ "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
29
+ "submitted_answer": "満足感",
30
+ "correct": false
31
+ }
32
+ ]
33
+ }
exports/gaia_results_20260104_220718.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "generated": "2026-01-04 22:07:18",
4
+ "timestamp": "20260104_220718",
5
+ "total_questions": 3,
6
+ "execution_time_seconds": 19.42,
7
+ "execution_time_formatted": "0m 19s",
8
+ "score_percent": 5.0,
9
+ "correct_count": 1,
10
+ "total_attempted": 3
11
+ },
12
+ "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/3 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
13
+ "results": [
14
+ {
15
+ "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
16
+ "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
17
+ "submitted_answer": "3",
18
+ "correct": true
19
+ },
20
+ {
21
+ "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
22
+ "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
23
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
24
+ "correct": false
25
+ },
26
+ {
27
+ "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
28
+ "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
29
+ "submitted_answer": "Unable to answer",
30
+ "correct": false
31
+ }
32
+ ]
33
+ }
output/gaia_results_20260104_213555.json ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "generated": "2026-01-04 21:35:55",
4
+ "timestamp": "20260104_213555",
5
+ "total_questions": 20,
6
+ "execution_time_seconds": 47.08,
7
+ "execution_time_formatted": "0m 47s",
8
+ "score_percent": 5.0,
9
+ "correct_count": 1,
10
+ "total_attempted": 20
11
+ },
12
+ "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/20 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
13
+ "results": [
14
+ {
15
+ "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
16
+ "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
17
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
18
+ "correct": null
19
+ },
20
+ {
21
+ "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
22
+ "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
23
+ "submitted_answer": "right",
24
+ "correct": null
25
+ },
26
+ {
27
+ "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
28
+ "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
29
+ "submitted_answer": "Unable to answer",
30
+ "correct": null
31
+ },
32
+ {
33
+ "task_id": "4fc2f1ae-8625-45b5-ab34-ad4433bc21f8",
34
+ "question": "Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?",
35
+ "submitted_answer": "Cas Liber",
36
+ "correct": null
37
+ },
38
+ {
39
+ "task_id": "cca530fc-4052-43b2-b130-b30968d8aa44",
40
+ "question": "Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.",
41
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
42
+ "correct": null
43
+ },
44
+ {
45
+ "task_id": "9d191bce-651d-4746-be2d-7ef8ecadb9c2",
46
+ "question": "Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.\n\nWhat does Teal'c say in response to the question \"Isn't that hot?\"",
47
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
48
+ "correct": null
49
+ },
50
+ {
51
+ "task_id": "6f37996b-2ac7-44b0-8e68-6d28256631b4",
52
+ "question": "Given this table defining * on the set S = {a, b, c, d, e}\n\n|*|a|b|c|d|e|\n|---|---|---|---|---|---|\n|a|a|b|c|b|d|\n|b|b|c|a|e|c|\n|c|c|a|b|b|a|\n|d|b|e|b|e|d|\n|e|d|b|a|d|c|\n\nprovide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.",
53
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: FileNotFoundError: Text file not found: table.csv",
54
+ "correct": null
55
+ },
56
+ {
57
+ "task_id": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3",
58
+ "question": "Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for \"a pinch of salt\" or \"two cups of ripe strawberries\" the ingredients on the list would be \"salt\" and \"ripe strawberries\".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients.",
59
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: .mp3. Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
60
+ "correct": null
61
+ },
62
+ {
63
+ "task_id": "f918266a-b3e0-4914-865d-4faa564f1aef",
64
+ "question": "What is the final numeric output from the attached Python code?",
65
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: . Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
66
+ "correct": null
67
+ },
68
+ {
69
+ "task_id": "3cef3a44-215e-4aed-8e3b-b1e3f08063b7",
70
+ "question": "I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when it comes to categorizing things. I need to add different foods to different categories on the grocery list, but if I make a mistake, she won't buy anything inserted in the wrong category. Here's the list I have so far:\n\nmilk, eggs, flour, whole bean coffee, Oreos, sweet potatoes, fresh basil, plums, green beans, rice, corn, bell pepper, whole allspice, acorns, broccoli, celery, zucchini, lettuce, peanuts\n\nI need to make headings for the fruits and vegetables. Could you please create a list of just the vegetables from my list? If you could do that, then I can figure out how to categorize the rest of the list into the appropriate categories. But remember that my mom is a real stickler, so make sure that no botanical fruits end up on the vegetable list, or she won't get them when she's at the store. Please alphabetize the list of vegetables, and place each item in a comma separated list.",
71
+ "submitted_answer": "acorns, bell pepper, broccoli, celery, green beans, lettuce, zucchini",
72
+ "correct": null
73
+ },
74
+ {
75
+ "task_id": "cabe07ed-9eca-40ea-8ead-410ef5e83f91",
76
+ "question": "What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?",
77
+ "submitted_answer": "Unable to answer",
78
+ "correct": null
79
+ },
80
+ {
81
+ "task_id": "305ac316-eef6-4446-960a-92d80d542f82",
82
+ "question": "Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in Magda M.? Give only the first name.",
83
+ "submitted_answer": "Bartłomiej",
84
+ "correct": null
85
+ },
86
+ {
87
+ "task_id": "1f975693-876d-457b-a649-393859e79bf3",
88
+ "question": "Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for my Calculus mid-term next week. My friend from class sent me an audio recording of Professor Willowbrook giving out the recommended reading for the test, but my headphones are broken :(\n\nCould you please listen to the recording for me and tell me the page numbers I'm supposed to go over? I've attached a file called Homework.mp3 that has the recording. Please provide just the page numbers as a comma-delimited list. And please provide the list in ascending order.",
89
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: .mp3. Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
90
+ "correct": null
91
+ },
92
+ {
93
+ "task_id": "3f57289b-8c60-48be-bd80-01f8099ca449",
94
+ "question": "How many at bats did the Yankee with the most walks in the 1977 regular season have that same season?",
95
+ "submitted_answer": "589",
96
+ "correct": null
97
+ },
98
+ {
99
+ "task_id": "cf106601-ab4f-4af9-b045-5295fe67b37d",
100
+ "question": "What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer.",
101
+ "submitted_answer": "CUB, MON",
102
+ "correct": null
103
+ },
104
+ {
105
+ "task_id": "7bd855d8-463d-4ed5-93ca-5fe35145f733",
106
+ "question": "The attached Excel file contains the sales of menu items for a local fast-food chain. What were the total sales that the chain made from food (not including drinks)? Express your answer in USD with two decimal places.",
107
+ "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: . Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
108
+ "correct": null
109
+ },
110
+ {
111
+ "task_id": "bda648d7-d618-4883-88f4-3466eabd860e",
112
+ "question": "Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city name without abbreviations.",
113
+ "submitted_answer": "St. Petersburg",
114
+ "correct": null
115
+ },
116
+ {
117
+ "task_id": "840bfca7-4f7b-481a-8794-c560c340185d",
118
+ "question": "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?",
119
+ "submitted_answer": "Unable to answer",
120
+ "correct": null
121
+ },
122
+ {
123
+ "task_id": "5a0c1adf-205e-4841-a666-7c3ef95def9d",
124
+ "question": "What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) whose nationality on record is a country that no longer exists?",
125
+ "submitted_answer": "Jan",
126
+ "correct": null
127
+ },
128
+ {
129
+ "task_id": "a0c07678-e491-4bbc-8f0b-07405144218f",
130
+ "question": "Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give them to me in the form Pitcher Before, Pitcher After, use their last names only, in Roman characters.",
131
+ "submitted_answer": "Unable to answer",
132
+ "correct": null
133
+ }
134
+ ]
135
+ }
requirements.txt CHANGED
@@ -59,3 +59,5 @@ python-dotenv>=1.0.0 # Environment variable management
59
  pydantic>=2.0.0 # Data validation (for StateGraph)
60
  typing-extensions>=4.12.0 # Type hints support
61
  tenacity>=8.2.0 # Retry logic with exponential backoff
 
 
 
59
  pydantic>=2.0.0 # Data validation (for StateGraph)
60
  typing-extensions>=4.12.0 # Type hints support
61
  tenacity>=8.2.0 # Retry logic with exponential backoff
62
+ datasets==4.4.2
63
+ huggingface-hub==1.2.3
src/utils/__init__.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ """Utility modules for GAIA agent.
2
+
3
+ Author: @mangobee
4
+ """
5
+
6
+ from .ground_truth import get_ground_truth, GAIAGroundTruth
7
+
8
+ __all__ = ["get_ground_truth", "GAIAGroundTruth"]
src/utils/ground_truth.py ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Ground truth comparison using GAIA validation dataset.
2
+
3
+ Author: @mangobee
4
+
5
+ Since the GAIA API only returns summary stats (X/Y correct) without per-question
6
+ correctness, we load the public validation dataset to compare our answers locally.
7
+ This enables per-question debugging and error analysis.
8
+ """
9
+
10
+ import os
11
+ import logging
12
+ from typing import Dict, Optional
13
+
14
+ logger = logging.getLogger(__name__)
15
+
16
+ # ============================================================================
17
+ # CONFIG
18
+ # ============================================================================
19
+ CACHE_DIR = os.path.expanduser("~/.cache/gaia_dataset")
20
+ # ============================================================================
21
+
22
+
23
+ class GAIAGroundTruth:
24
+ """Load GAIA validation dataset and provide ground truth answers."""
25
+
26
+ def __init__(self):
27
+ """Initialize ground truth loader."""
28
+ self.ground_truth: Dict[str, str] = {}
29
+ self._loaded = False
30
+
31
+ def load_validation_set(self) -> bool:
32
+ """Load GAIA validation dataset from HuggingFace.
33
+
34
+ Returns:
35
+ bool: True if loaded successfully, False otherwise
36
+ """
37
+ if self._loaded:
38
+ return True
39
+
40
+ try:
41
+ from datasets import load_dataset
42
+
43
+ logger.info("Loading GAIA validation dataset...")
44
+
45
+ # Load validation set (public answers)
46
+ # Using 2023_all which includes all levels
47
+ dataset = load_dataset(
48
+ "gaia-benchmark/GAIA",
49
+ "2023_all",
50
+ split="validation",
51
+ cache_dir=CACHE_DIR
52
+ )
53
+
54
+ # Build task_id -> final_answer mapping
55
+ for item in dataset:
56
+ task_id = item.get("task_id")
57
+ final_answer = item.get("Final answer")
58
+
59
+ if task_id and final_answer:
60
+ self.ground_truth[task_id] = str(final_answer).strip()
61
+
62
+ self._loaded = True
63
+ logger.info(f"Loaded {len(self.ground_truth)} ground truth answers")
64
+ return True
65
+
66
+ except Exception as e:
67
+ logger.error(f"Failed to load GAIA dataset: {e}")
68
+ return False
69
+
70
+ def get_answer(self, task_id: str) -> Optional[str]:
71
+ """Get ground truth answer for a task_id.
72
+
73
+ Args:
74
+ task_id: Question task ID
75
+
76
+ Returns:
77
+ Ground truth answer or None if not found
78
+ """
79
+ if not self._loaded:
80
+ self.load_validation_set()
81
+
82
+ return self.ground_truth.get(task_id)
83
+
84
+ def compare_answer(self, task_id: str, submitted_answer: str) -> Optional[bool]:
85
+ """Compare submitted answer against ground truth.
86
+
87
+ Args:
88
+ task_id: Question task ID
89
+ submitted_answer: Answer submitted by agent
90
+
91
+ Returns:
92
+ True if correct, False if incorrect, None if no ground truth available
93
+ """
94
+ ground_truth = self.get_answer(task_id)
95
+
96
+ if ground_truth is None:
97
+ return None
98
+
99
+ # Normalize both answers for comparison
100
+ submitted = str(submitted_answer).strip().lower()
101
+ expected = str(ground_truth).strip().lower()
102
+
103
+ # Exact match comparison
104
+ return submitted == expected
105
+
106
+
107
+ # Singleton instance
108
+ _ground_truth_instance = None
109
+
110
+
111
+ def get_ground_truth() -> GAIAGroundTruth:
112
+ """Get or create singleton ground truth instance.
113
+
114
+ Returns:
115
+ GAIAGroundTruth instance
116
+ """
117
+ global _ground_truth_instance
118
+
119
+ if _ground_truth_instance is None:
120
+ _ground_truth_instance = GAIAGroundTruth()
121
+
122
+ return _ground_truth_instance