agentbee

Running

App Files Files Community

mangubee commited on 21 days ago

Commit

9fb23b8

1 Parent(s): 0292109

Integrate benchmark dataset with results from HF as groundtruth

Browse files

Files changed (9) hide show

CHANGELOG.md +93 -14
app.py +145 -93
exports/gaia_results_20260104_214534.json +33 -0
exports/gaia_results_20260104_220404.json +33 -0
exports/gaia_results_20260104_220718.json +33 -0
output/gaia_results_20260104_213555.json +135 -0
requirements.txt +2 -0
src/utils/__init__.py +8 -0
src/utils/ground_truth.py +122 -0

CHANGELOG.md CHANGED Viewed

@@ -335,32 +335,111 @@
 - ✅ Correct answer parsing from submission response implemented
 - ⏳ Testing with real GAIA submission pending
-### [BUGFIX: Useless "Correct?" Column Message - Remove When No Data]
-**Problem:** "Correct?" column shows "See summary: 2/20 correct" for every row when GAIA API doesn't provide per-question correctness data. This is useless and clutters the table.
-**Root Cause:** GAIA API response doesn't include per-question correctness in `result_data["results"]`, only summary stats (`correct_count`, `total_attempted`). Code fell through to else clause showing same message for all rows.
 **Modified Files:**
-- **app.py** (~5 lines modified)
-  - Updated correct answer column logic (lines 406-410)
-  - Removed fallback "See summary" message
-  - Now only adds "Correct?" column if per-question correctness data available
-  - If no per-question data, column is simply omitted from results table
 **Solution:**
-- When `correct_task_ids` is empty (no per-question data), don't add "Correct?" column at all
-- JSON export still includes `"correct": null` for proper data structure
-- User sees score summary in submission status message instead
 **Verification:**
-- ✅ No useless repetitive message in results table
-- ✅ Column only appears when API provides per-question correctness
-- ⏳ Testing with real GAIA submission pending
 ### Created Files
 ### Deleted Files

 - ✅ Correct answer parsing from submission response implemented
 - ⏳ Testing with real GAIA submission pending
+### [BUGFIX: GAIA API Limitation - Per-Question Correctness Unavailable]
+**Problem:** User reported "Correct?" column showing "null" in JSON export and missing from UI table. Investigation revealed GAIA API doesn't provide per-question correctness data.
+**Root Cause:** GAIA API response structure only includes summary stats:
+```json
+{
+  "username": "...",
+  "score": 5.0,
+  "correct_count": 1,
+  "total_attempted": 3,
+  "message": "...",
+  "timestamp": "..."
+}
+```
+No "results" array exists with per-question correctness. API tells us "1/3 correct" but NOT which specific questions are correct.
 **Modified Files:**
+- **.env** (~2 lines added)
+  - Added `DEBUG_QUESTION_LIMIT=3` - Limit questions for faster API response debugging (0 = process all)
+- **app.py** (~40 lines modified)
+  - Removed useless `correct_task_ids` extraction logic (lines 452-457 deleted)
+  - Removed useless "Correct?" column addition logic (lines 460-465 deleted)
+  - Added clear comment documenting API limitation (lines 444-447)
+  - Updated `export_results_to_json()` - Removed extraction logic (lines 78-84 deleted)
+  - Simplified JSON export - Hardcoded `"correct": None` with explanatory comment (lines 106-107)
+  - Added `DEBUG_QUESTION_LIMIT` support for faster testing (lines 320-324)
 **Solution:**
+- UI table: No "Correct?" column (cleanly omitted, not showing useless data)
+- JSON export: `"correct": null` for all questions (API doesn't provide this data)
+- Metadata: Includes summary stats (`score_percent`, `correct_count`, `total_attempted`)
+- User sees score summary in submission status message: "5.0% (1/3 correct)"
 **Verification:**
+- ✅ Debug logging confirmed API response structure (no "results" field)
+- ✅ Cleaned up ~30 lines of useless extraction code
+- ✅ Clear comments document the limitation for future maintainers
+- ✅ JSON export maintains data structure with explicit null values
+### [FEATURE: Ground Truth Comparison - GAIA Validation Dataset Integration]
+**Problem:** GAIA API doesn't provide per-question correctness, making it impossible to debug which specific questions are failing. Need local ground truth comparison for development.
+**Solution:** Integrate GAIA validation dataset from HuggingFace to compare submitted answers against ground truth locally.
+**Modified Files:**
+- **pyproject.toml / requirements.txt** (~2 packages added)
+  - Added `datasets>=4.4.2` - HuggingFace datasets library
+  - Added `huggingface-hub` - Dataset download and caching
+- **src/utils/ground_truth.py** (NEW - ~120 lines)
+  - Created `GAIAGroundTruth` class - Loads validation dataset, provides ground truth answers
+  - `load_validation_set()` - Downloads GAIA validation set (2023_all split)
+  - `get_answer(task_id)` - Returns ground truth answer for a question
+  - `compare_answer(task_id, submitted_answer)` - Compares submitted vs ground truth (exact match)
+  - Singleton pattern with `get_ground_truth()` helper
+  - Caches dataset to `~/.cache/gaia_dataset` for fast reloading
+- **src/utils/__init__.py** (NEW - ~7 lines)
+  - Package initialization for utils module
+- **app.py** (~25 lines modified)
+  - Added import: `from src.utils.ground_truth import get_ground_truth` (line 15)
+  - Added ground truth loading after fetching questions (lines 357-362)
+  - Updated results collection to include ground truth comparison (lines 386-398)
+    - Calls `ground_truth.compare_answer()` for each result
+    - Adds "Correct?" column to results_log if ground truth available
+    - Shows "✅ Yes" or "❌ No" in UI table
+  - Updated JSON export to include ground truth correctness (lines 110-112)
+    - Converts "✅ Yes" → true, "❌ No" → false, missing → null
+**Benefits:**
+- ✅ **Local debugging:** See which specific questions are correct/incorrect without API dependency
+- ✅ **Validation set only:** Only works on public validation questions (test set has private answers)
+- ✅ **UI visibility:** "Correct?" column appears in results table when ground truth available
+- ✅ **JSON export:** Per-question `"correct": true/false` for error analysis
+- ✅ **Fast caching:** Dataset downloaded once, cached locally for reuse
+- ✅ **Graceful fallback:** If dataset unavailable, system continues without ground truth
+**Dataset Structure:**
+```python
+# GAIA validation dataset (2023_all split)
+# Fields: task_id, Question, Level, Final answer, file_name, file_path, Annotator Metadata
+# ~165 validation questions with ground truth answers
+```
+**Verification:**
+- ⏳ Testing with validation set questions pending
+- ⏳ Verify exact match comparison works correctly
+- ⏳ Check performance with dataset caching
 ### Created Files
+- src/utils/ground_truth.py
+- src/utils/__init__.py
 ### Deleted Files

app.py CHANGED Viewed

@@ -11,10 +11,12 @@ from concurrent.futures import ThreadPoolExecutor, as_completed
 # Stage 1: Import GAIAAgent (LangGraph-based agent)
 from src.agent import GAIAAgent
 # Configure logging
 logging.basicConfig(
-    level=logging.INFO,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
 )
 logger = logging.getLogger(__name__)
@@ -27,17 +29,27 @@ DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
 def check_api_keys():
     """Check which API keys are configured."""
     keys_status = {
-        "GOOGLE_API_KEY (Gemini)": "✓ SET" if os.getenv("GOOGLE_API_KEY") else "✗ MISSING",
         "HF_TOKEN (HuggingFace)": "✓ SET" if os.getenv("HF_TOKEN") else "✗ MISSING",
-        "ANTHROPIC_API_KEY (Claude)": "✓ SET" if os.getenv("ANTHROPIC_API_KEY") else "✗ MISSING",
-        "TAVILY_API_KEY (Search)": "✓ SET" if os.getenv("TAVILY_API_KEY") else "✗ MISSING",
         "EXA_API_KEY (Search)": "✓ SET" if os.getenv("EXA_API_KEY") else "✗ MISSING",
     }
     return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
-def export_results_to_json(results_log: list, submission_status: str, execution_time: float = None,
-                          submission_response: dict = None) -> str:
     """Export evaluation results to JSON file for easy processing.
     - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
@@ -66,27 +78,21 @@ def export_results_to_json(results_log: list, submission_status: str, execution_
         downloads_dir = os.path.expanduser("~/Downloads")
         filepath = os.path.join(downloads_dir, filename)
-    # Extract correctness info from submission response if available
-    correct_task_ids = set()
-    if submission_response and "results" in submission_response:
-        # If API provides per-question results
-        for item in submission_response.get("results", []):
-            if item.get("correct"):
-                correct_task_ids.add(item.get("task_id"))
     # Build JSON structure
     metadata = {
         "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
         "timestamp": timestamp,
-        "total_questions": len(results_log)
     }
     # Add execution time if available
     if execution_time is not None:
         metadata["execution_time_seconds"] = round(execution_time, 2)
-        metadata["execution_time_formatted"] = f"{int(execution_time // 60)}m {int(execution_time % 60)}s"
-    # Add score info if available
     if submission_response:
         metadata["score_percent"] = submission_response.get("score")
         metadata["correct_count"] = submission_response.get("correct_count")
@@ -100,14 +106,17 @@ def export_results_to_json(results_log: list, submission_status: str, execution_
                 "task_id": result.get("Task ID", "N/A"),
                 "question": result.get("Question", "N/A"),
                 "submitted_answer": result.get("Submitted Answer", "N/A"),
-                "correct": result.get("Task ID") in correct_task_ids if correct_task_ids else None
             }
             for result in results_log
-        ]
     }
     # Write JSON file with pretty formatting
-    with open(filepath, 'w', encoding='utf-8') as f:
         json.dump(export_data, f, indent=2, ensure_ascii=False)
     logger.info(f"Results exported to: {filepath}")
@@ -122,39 +131,43 @@ def format_diagnostics(final_state: dict) -> str:
     diagnostics.append(f"**Question:** {final_state.get('question', 'N/A')}\n")
     # Plan
-    plan = final_state.get('plan', 'No plan generated')
     diagnostics.append(f"**Plan:**\n{plan}\n")
     # Tool calls
-    tool_calls = final_state.get('tool_calls', [])
     if tool_calls:
         diagnostics.append(f"**Tools Selected:** {len(tool_calls)} tool(s)")
         for idx, tc in enumerate(tool_calls, 1):
-            tool_name = tc.get('tool', 'unknown')
-            params = tc.get('params', {})
             diagnostics.append(f"  {idx}. {tool_name}({params})")
         diagnostics.append("")
     else:
         diagnostics.append("**Tools Selected:** None\n")
     # Tool results
-    tool_results = final_state.get('tool_results', [])
     if tool_results:
         diagnostics.append(f"**Tool Execution Results:** {len(tool_results)} result(s)")
         for idx, tr in enumerate(tool_results, 1):
-            tool_name = tr.get('tool', 'unknown')
-            status = tr.get('status', 'unknown')
-            if status == 'success':
-                result_preview = str(tr.get('result', ''))[:100] + "..." if len(str(tr.get('result', ''))) > 100 else str(tr.get('result', ''))
                 diagnostics.append(f"  {idx}. {tool_name}: ✓ SUCCESS")
                 diagnostics.append(f"     Result: {result_preview}")
             else:
-                error = tr.get('error', 'Unknown error')
                 diagnostics.append(f"  {idx}. {tool_name}: ✗ FAILED - {error}")
         diagnostics.append("")
     # Evidence
-    evidence = final_state.get('evidence', [])
     if evidence:
         diagnostics.append(f"**Evidence Collected:** {len(evidence)} item(s)")
         for idx, ev in enumerate(evidence, 1):
@@ -165,7 +178,7 @@ def format_diagnostics(final_state: dict) -> str:
         diagnostics.append("**Evidence Collected:** None\n")
     # Errors
-    errors = final_state.get('errors', [])
     if errors:
         diagnostics.append(f"**Errors:** {len(errors)} error(s)")
         for idx, err in enumerate(errors, 1):
@@ -173,7 +186,7 @@ def format_diagnostics(final_state: dict) -> str:
         diagnostics.append("")
     # Answer
-    answer = final_state.get('answer', 'No answer generated')
     diagnostics.append(f"**Final Answer:** {answer}")
     return "\n".join(diagnostics)
@@ -189,7 +202,9 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
         os.environ["LLM_PROVIDER"] = llm_provider.lower()
         os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
-        logger.info(f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
         # Initialize agent
         agent = GAIAAgent()
@@ -242,31 +257,33 @@ def process_single_question(agent, item, index, total):
             "task_id": task_id,
             "question": question_text,
             "answer": "ERROR: Missing task_id or question",
-            "error": True
         }
     try:
-        logger.info(f"[{index+1}/{total}] Processing {task_id[:8]}...")
         submitted_answer = agent(question_text)
-        logger.info(f"[{index+1}/{total}] Completed {task_id[:8]}")
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": submitted_answer,
-            "error": False
         }
     except Exception as e:
-        logger.error(f"[{index+1}/{total}] Error {task_id[:8]}: {e}")
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": f"ERROR: {str(e)}",
-            "error": True
         }
-def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAuthProfile | None = None):
     """
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
@@ -291,7 +308,9 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
     # Set LLM provider from UI selection (overrides .env)
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
     os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
-    logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}")
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
@@ -315,7 +334,17 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         if not questions_data:
             print("Fetched questions list is empty.")
             return "Fetched questions list is empty or invalid format.", None, ""
-        print(f"Fetched {len(questions_data)} questions.")
     except requests.exceptions.RequestException as e:
         print(f"Error fetching questions: {e}")
         return f"Error fetching questions: {e}", None, ""
@@ -327,17 +356,28 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         print(f"An unexpected error occurred fetching questions: {e}")
         return f"An unexpected error occurred fetching questions: {e}", None, ""
     # 3. Run your Agent (Stage 6: Concurrent processing)
     max_workers = int(os.getenv("MAX_CONCURRENT_WORKERS", "5"))
     results_log = []
     answers_payload = []
-    logger.info(f"Running agent on {len(questions_data)} questions with {max_workers} workers...")
     with ThreadPoolExecutor(max_workers=max_workers) as executor:
         # Submit all questions for concurrent processing
         future_to_index = {
-            executor.submit(process_single_question, agent, item, idx, len(questions_data)): idx
             for idx, item in enumerate(questions_data)
         }
@@ -345,29 +385,41 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         for future in as_completed(future_to_index):
             result = future.result()
             # Add to results log
-            results_log.append({
                 "Task ID": result["task_id"],
                 "Question": result["question"],
                 "Submitted Answer": result["answer"],
-            })
             # Add to submission payload if no error
             if not result["error"]:
-                answers_payload.append({
-                    "task_id": result["task_id"],
-                    "submitted_answer": result["answer"]
-                })
             # Log progress
-            logger.info(f"Progress: {len(results_log)}/{len(questions_data)} questions processed")
     if not answers_payload:
         print("Agent did not produce any answers to submit.")
         status_message = "Agent did not produce any answers to submit."
         results_df = pd.DataFrame(results_log)
         execution_time = time.time() - start_time
-        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     # 4. Prepare Submission
@@ -385,6 +437,7 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         response = requests.post(submit_url, json=submission_data, timeout=60)
         response.raise_for_status()
         result_data = response.json()
         final_status = (
             f"Submission Successful!\n"
             f"User: {result_data.get('username')}\n"
@@ -394,24 +447,20 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         )
         print("Submission successful.")
         execution_time = time.time() - start_time
-        logger.info(f"Total execution time: {execution_time:.2f} seconds ({int(execution_time // 60)}m {int(execution_time % 60)}s)")
-        # Extract correct task_ids from result_data if available
-        correct_task_ids = set()
-        if "results" in result_data:
-            for item in result_data.get("results", []):
-                if item.get("correct"):
-                    correct_task_ids.add(item.get("task_id"))
-        # Add "Correct?" column to results (only if we have per-question correctness data)
-        if correct_task_ids:
-            for result in results_log:
-                task_id = result.get("Task ID")
-                result["Correct?"] = "✅ Yes" if task_id in correct_task_ids else "❌ No"
         results_df = pd.DataFrame(results_log)
         # Export to JSON with execution time and submission response
-        export_path = export_results_to_json(results_log, final_status, execution_time, result_data)
         return final_status, results_df, export_path
     except requests.exceptions.HTTPError as e:
         error_detail = f"Server responded with status {e.response.status_code}."
@@ -424,28 +473,36 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except requests.exceptions.Timeout:
         status_message = "Submission Failed: The request timed out."
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except requests.exceptions.RequestException as e:
         status_message = f"Submission Failed: Network error - {e}"
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except Exception as e:
         status_message = f"An unexpected error occurred during submission: {e}"
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
@@ -476,7 +533,7 @@ with gr.Blocks() as demo:
             test_question_input = gr.Textbox(
                 label="Enter Test Question",
                 placeholder="e.g., What is the capital of France?",
-                lines=3
             )
             with gr.Row():
@@ -484,12 +541,12 @@ with gr.Blocks() as demo:
                     label="LLM Provider",
                     choices=["Gemini", "HuggingFace", "Groq", "Claude"],
                     value="Groq",
-                    info="Select which LLM to use for this test"
                 )
                 enable_fallback_checkbox = gr.Checkbox(
                     label="Enable Fallback",
                     value=False,
-                    info="If enabled, falls back to other providers on failure"
                 )
             test_button = gr.Button("Run Test", variant="primary")
@@ -497,26 +554,24 @@ with gr.Blocks() as demo:
             with gr.Row():
                 with gr.Column(scale=1):
                     test_answer_output = gr.Textbox(
-                        label="Answer",
-                        lines=3,
-                        interactive=False
                     )
                     test_api_status = gr.Textbox(
-                        label="API Keys Status",
-                        lines=5,
-                        interactive=False
                     )
                 with gr.Column(scale=2):
                     test_diagnostics_output = gr.Textbox(
-                        label="Execution Diagnostics",
-                        lines=20,
-                        interactive=False
                     )
             test_button.click(
                 fn=test_single_question,
-                inputs=[test_question_input, llm_provider_dropdown, enable_fallback_checkbox],
-                outputs=[test_answer_output, test_diagnostics_output, test_api_status]
             )
         # Tab 2: Full Evaluation (existing functionality)
@@ -543,12 +598,12 @@ with gr.Blocks() as demo:
                     label="LLM Provider for Evaluation",
                     choices=["Gemini", "HuggingFace", "Groq", "Claude"],
                     value="Groq",
-                    info="Select which LLM to use for all questions"
                 )
                 eval_enable_fallback_checkbox = gr.Checkbox(
                     label="Enable Fallback",
                     value=True,
-                    info="Recommended: Enable fallback for production evaluation"
                 )
             run_button = gr.Button("Run Evaluation & Submit All Answers")
@@ -559,15 +614,12 @@ with gr.Blocks() as demo:
             # Removed max_rows=10 from DataFrame constructor
             results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
-            export_output = gr.File(
-                label="Download Results",
-                type="filepath"
-            )
             run_button.click(
                 fn=run_and_submit_all,
                 inputs=[eval_llm_provider_dropdown, eval_enable_fallback_checkbox],
-                outputs=[status_output, results_table, export_output]
             )
 if __name__ == "__main__":

 # Stage 1: Import GAIAAgent (LangGraph-based agent)
 from src.agent import GAIAAgent
+# Import ground truth comparison
+from src.utils.ground_truth import get_ground_truth
 # Configure logging
 logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
 )
 logger = logging.getLogger(__name__)
 def check_api_keys():
     """Check which API keys are configured."""
     keys_status = {
+        "GOOGLE_API_KEY (Gemini)": "✓ SET"
+        if os.getenv("GOOGLE_API_KEY")
+        else "✗ MISSING",
         "HF_TOKEN (HuggingFace)": "✓ SET" if os.getenv("HF_TOKEN") else "✗ MISSING",
+        "ANTHROPIC_API_KEY (Claude)": "✓ SET"
+        if os.getenv("ANTHROPIC_API_KEY")
+        else "✗ MISSING",
+        "TAVILY_API_KEY (Search)": "✓ SET"
+        if os.getenv("TAVILY_API_KEY")
+        else "✗ MISSING",
         "EXA_API_KEY (Search)": "✓ SET" if os.getenv("EXA_API_KEY") else "✗ MISSING",
     }
     return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
+def export_results_to_json(
+    results_log: list,
+    submission_status: str,
+    execution_time: float = None,
+    submission_response: dict = None,
+) -> str:
     """Export evaluation results to JSON file for easy processing.
     - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
         downloads_dir = os.path.expanduser("~/Downloads")
         filepath = os.path.join(downloads_dir, filename)
     # Build JSON structure
     metadata = {
         "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
         "timestamp": timestamp,
+        "total_questions": len(results_log),
     }
     # Add execution time if available
     if execution_time is not None:
         metadata["execution_time_seconds"] = round(execution_time, 2)
+        metadata["execution_time_formatted"] = (
+            f"{int(execution_time // 60)}m {int(execution_time % 60)}s"
+        )
+    # Add score info if available (summary stats only - no per-question correctness)
     if submission_response:
         metadata["score_percent"] = submission_response.get("score")
         metadata["correct_count"] = submission_response.get("correct_count")
                 "task_id": result.get("Task ID", "N/A"),
                 "question": result.get("Question", "N/A"),
                 "submitted_answer": result.get("Submitted Answer", "N/A"),
+                # Use ground truth comparison if available, otherwise null
+                "correct": True if result.get("Correct?") == "✅ Yes"
+                          else False if result.get("Correct?") == "❌ No"
+                          else None,
             }
             for result in results_log
+        ],
     }
     # Write JSON file with pretty formatting
+    with open(filepath, "w", encoding="utf-8") as f:
         json.dump(export_data, f, indent=2, ensure_ascii=False)
     logger.info(f"Results exported to: {filepath}")
     diagnostics.append(f"**Question:** {final_state.get('question', 'N/A')}\n")
     # Plan
+    plan = final_state.get("plan", "No plan generated")
     diagnostics.append(f"**Plan:**\n{plan}\n")
     # Tool calls
+    tool_calls = final_state.get("tool_calls", [])
     if tool_calls:
         diagnostics.append(f"**Tools Selected:** {len(tool_calls)} tool(s)")
         for idx, tc in enumerate(tool_calls, 1):
+            tool_name = tc.get("tool", "unknown")
+            params = tc.get("params", {})
             diagnostics.append(f"  {idx}. {tool_name}({params})")
         diagnostics.append("")
     else:
         diagnostics.append("**Tools Selected:** None\n")
     # Tool results
+    tool_results = final_state.get("tool_results", [])
     if tool_results:
         diagnostics.append(f"**Tool Execution Results:** {len(tool_results)} result(s)")
         for idx, tr in enumerate(tool_results, 1):
+            tool_name = tr.get("tool", "unknown")
+            status = tr.get("status", "unknown")
+            if status == "success":
+                result_preview = (
+                    str(tr.get("result", ""))[:100] + "..."
+                    if len(str(tr.get("result", ""))) > 100
+                    else str(tr.get("result", ""))
+                )
                 diagnostics.append(f"  {idx}. {tool_name}: ✓ SUCCESS")
                 diagnostics.append(f"     Result: {result_preview}")
             else:
+                error = tr.get("error", "Unknown error")
                 diagnostics.append(f"  {idx}. {tool_name}: ✗ FAILED - {error}")
         diagnostics.append("")
     # Evidence
+    evidence = final_state.get("evidence", [])
     if evidence:
         diagnostics.append(f"**Evidence Collected:** {len(evidence)} item(s)")
         for idx, ev in enumerate(evidence, 1):
         diagnostics.append("**Evidence Collected:** None\n")
     # Errors
+    errors = final_state.get("errors", [])
     if errors:
         diagnostics.append(f"**Errors:** {len(errors)} error(s)")
         for idx, err in enumerate(errors, 1):
         diagnostics.append("")
     # Answer
+    answer = final_state.get("answer", "No answer generated")
     diagnostics.append(f"**Final Answer:** {answer}")
     return "\n".join(diagnostics)
         os.environ["LLM_PROVIDER"] = llm_provider.lower()
         os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
+        logger.info(
+            f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
+        )
         # Initialize agent
         agent = GAIAAgent()
             "task_id": task_id,
             "question": question_text,
             "answer": "ERROR: Missing task_id or question",
+            "error": True,
         }
     try:
+        logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
         submitted_answer = agent(question_text)
+        logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": submitted_answer,
+            "error": False,
         }
     except Exception as e:
+        logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": f"ERROR: {str(e)}",
+            "error": True,
         }
+def run_and_submit_all(
+    llm_provider: str, enable_fallback: bool, profile: gr.OAuthProfile | None = None
+):
     """
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
     # Set LLM provider from UI selection (overrides .env)
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
     os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
+    logger.info(
+        f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
+    )
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
         if not questions_data:
             print("Fetched questions list is empty.")
             return "Fetched questions list is empty or invalid format.", None, ""
+        # Apply debug limit if configured
+        debug_limit = int(os.getenv("DEBUG_QUESTION_LIMIT", "0"))
+        if debug_limit > 0:
+            questions_data = questions_data[:debug_limit]
+            logger.warning(f"DEBUG MODE: Limited to first {debug_limit} questions")
+            print(
+                f"DEBUG MODE: Processing only {debug_limit} questions (set DEBUG_QUESTION_LIMIT=0 to disable)"
+            )
+        print(f"Processing {len(questions_data)} questions.")
     except requests.exceptions.RequestException as e:
         print(f"Error fetching questions: {e}")
         return f"Error fetching questions: {e}", None, ""
         print(f"An unexpected error occurred fetching questions: {e}")
         return f"An unexpected error occurred fetching questions: {e}", None, ""
+    # 2.5. Load ground truth for local comparison (validation set only)
+    ground_truth = get_ground_truth()
+    if ground_truth.load_validation_set():
+        logger.info("Ground truth loaded - per-question correctness will be available")
+    else:
+        logger.warning("Ground truth not loaded - per-question correctness unavailable")
     # 3. Run your Agent (Stage 6: Concurrent processing)
     max_workers = int(os.getenv("MAX_CONCURRENT_WORKERS", "5"))
     results_log = []
     answers_payload = []
+    logger.info(
+        f"Running agent on {len(questions_data)} questions with {max_workers} workers..."
+    )
     with ThreadPoolExecutor(max_workers=max_workers) as executor:
         # Submit all questions for concurrent processing
         future_to_index = {
+            executor.submit(
+                process_single_question, agent, item, idx, len(questions_data)
+            ): idx
             for idx, item in enumerate(questions_data)
         }
         for future in as_completed(future_to_index):
             result = future.result()
+            # Compare with ground truth if available
+            is_correct = ground_truth.compare_answer(result["task_id"], result["answer"])
             # Add to results log
+            result_entry = {
                 "Task ID": result["task_id"],
                 "Question": result["question"],
                 "Submitted Answer": result["answer"],
+            }
+            # Add "Correct?" column if ground truth available
+            if is_correct is not None:
+                result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
+            results_log.append(result_entry)
             # Add to submission payload if no error
             if not result["error"]:
+                answers_payload.append(
+                    {"task_id": result["task_id"], "submitted_answer": result["answer"]}
+                )
             # Log progress
+            logger.info(
+                f"Progress: {len(results_log)}/{len(questions_data)} questions processed"
+            )
     if not answers_payload:
         print("Agent did not produce any answers to submit.")
         status_message = "Agent did not produce any answers to submit."
         results_df = pd.DataFrame(results_log)
         execution_time = time.time() - start_time
+        export_path = export_results_to_json(
+            results_log, status_message, execution_time
+        )
         return status_message, results_df, export_path
     # 4. Prepare Submission
         response = requests.post(submit_url, json=submission_data, timeout=60)
         response.raise_for_status()
         result_data = response.json()
         final_status = (
             f"Submission Successful!\n"
             f"User: {result_data.get('username')}\n"
         )
         print("Submission successful.")
         execution_time = time.time() - start_time
+        logger.info(
+            f"Total execution time: {execution_time:.2f} seconds ({int(execution_time // 60)}m {int(execution_time % 60)}s)"
+        )
+        # LIMITATION: GAIA API does NOT provide per-question correctness data
+        # API response structure: {username, score, correct_count, total_attempted, message, timestamp}
+        # No "results" array exists - we only get summary stats, not which specific questions are correct
+        # Therefore: UI table has no "Correct?" column, JSON export shows "correct": null for all questions
         results_df = pd.DataFrame(results_log)
         # Export to JSON with execution time and submission response
+        export_path = export_results_to_json(
+            results_log, final_status, execution_time, result_data
+        )
         return final_status, results_df, export_path
     except requests.exceptions.HTTPError as e:
         error_detail = f"Server responded with status {e.response.status_code}."
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(
+            results_log, status_message, execution_time
+        )
         return status_message, results_df, export_path
     except requests.exceptions.Timeout:
         status_message = "Submission Failed: The request timed out."
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(
+            results_log, status_message, execution_time
+        )
         return status_message, results_df, export_path
     except requests.exceptions.RequestException as e:
         status_message = f"Submission Failed: Network error - {e}"
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(
+            results_log, status_message, execution_time
+        )
         return status_message, results_df, export_path
     except Exception as e:
         status_message = f"An unexpected error occurred during submission: {e}"
         print(status_message)
         execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(
+            results_log, status_message, execution_time
+        )
         return status_message, results_df, export_path
             test_question_input = gr.Textbox(
                 label="Enter Test Question",
                 placeholder="e.g., What is the capital of France?",
+                lines=3,
             )
             with gr.Row():
                     label="LLM Provider",
                     choices=["Gemini", "HuggingFace", "Groq", "Claude"],
                     value="Groq",
+                    info="Select which LLM to use for this test",
                 )
                 enable_fallback_checkbox = gr.Checkbox(
                     label="Enable Fallback",
                     value=False,
+                    info="If enabled, falls back to other providers on failure",
                 )
             test_button = gr.Button("Run Test", variant="primary")
             with gr.Row():
                 with gr.Column(scale=1):
                     test_answer_output = gr.Textbox(
+                        label="Answer", lines=3, interactive=False
                     )
                     test_api_status = gr.Textbox(
+                        label="API Keys Status", lines=5, interactive=False
                     )
                 with gr.Column(scale=2):
                     test_diagnostics_output = gr.Textbox(
+                        label="Execution Diagnostics", lines=20, interactive=False
                     )
             test_button.click(
                 fn=test_single_question,
+                inputs=[
+                    test_question_input,
+                    llm_provider_dropdown,
+                    enable_fallback_checkbox,
+                ],
+                outputs=[test_answer_output, test_diagnostics_output, test_api_status],
             )
         # Tab 2: Full Evaluation (existing functionality)
                     label="LLM Provider for Evaluation",
                     choices=["Gemini", "HuggingFace", "Groq", "Claude"],
                     value="Groq",
+                    info="Select which LLM to use for all questions",
                 )
                 eval_enable_fallback_checkbox = gr.Checkbox(
                     label="Enable Fallback",
                     value=True,
+                    info="Recommended: Enable fallback for production evaluation",
                 )
             run_button = gr.Button("Run Evaluation & Submit All Answers")
             # Removed max_rows=10 from DataFrame constructor
             results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
+            export_output = gr.File(label="Download Results", type="filepath")
             run_button.click(
                 fn=run_and_submit_all,
                 inputs=[eval_llm_provider_dropdown, eval_enable_fallback_checkbox],
+                outputs=[status_output, results_table, export_output],
             )
 if __name__ == "__main__":

exports/gaia_results_20260104_214534.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "metadata": {
+    "generated": "2026-01-04 21:45:34",
+    "timestamp": "20260104_214534",
+    "total_questions": 3,
+    "execution_time_seconds": 14.57,
+    "execution_time_formatted": "0m 14s",
+    "score_percent": 5.0,
+    "correct_count": 1,
+    "total_attempted": 3
+  },
+  "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/3 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
+  "results": [
+    {
+      "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+      "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
+      "submitted_answer": "Unable to answer",
+      "correct": null
+    },
+    {
+      "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+      "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": null
+    },
+    {
+      "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
+      "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
+      "submitted_answer": "right",
+      "correct": null
+    }
+  ]
+}

exports/gaia_results_20260104_220404.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "metadata": {
+    "generated": "2026-01-04 22:04:04",
+    "timestamp": "20260104_220404",
+    "total_questions": 3,
+    "execution_time_seconds": 21.65,
+    "execution_time_formatted": "0m 21s",
+    "score_percent": 0.0,
+    "correct_count": 0,
+    "total_attempted": 3
+  },
+  "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 0.0% (0/3 correct)\nMessage: Score calculated successfully: 0/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
+  "results": [
+    {
+      "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+      "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
+      "submitted_answer": "Unable to answer",
+      "correct": false
+    },
+    {
+      "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+      "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": false
+    },
+    {
+      "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
+      "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
+      "submitted_answer": "満足感",
+      "correct": false
+    }
+  ]
+}

exports/gaia_results_20260104_220718.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "metadata": {
+    "generated": "2026-01-04 22:07:18",
+    "timestamp": "20260104_220718",
+    "total_questions": 3,
+    "execution_time_seconds": 19.42,
+    "execution_time_formatted": "0m 19s",
+    "score_percent": 5.0,
+    "correct_count": 1,
+    "total_attempted": 3
+  },
+  "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/3 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (3 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
+  "results": [
+    {
+      "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+      "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
+      "submitted_answer": "3",
+      "correct": true
+    },
+    {
+      "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+      "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": false
+    },
+    {
+      "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
+      "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
+      "submitted_answer": "Unable to answer",
+      "correct": false
+    }
+  ]
+}

output/gaia_results_20260104_213555.json ADDED Viewed

	@@ -0,0 +1,135 @@

+{
+  "metadata": {
+    "generated": "2026-01-04 21:35:55",
+    "timestamp": "20260104_213555",
+    "total_questions": 20,
+    "execution_time_seconds": 47.08,
+    "execution_time_formatted": "0m 47s",
+    "score_percent": 5.0,
+    "correct_count": 1,
+    "total_attempted": 20
+  },
+  "submission_status": "Submission Successful!\nUser: mangoobee\nOverall Score: 5.0% (1/20 correct)\nMessage: Score calculated successfully: 1/20 total questions answered correctly (20 valid tasks attempted). Score did not improve previous record, leaderboard not updated.",
+  "results": [
+    {
+      "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+      "question": "In the video https://www.youtube.com/watch?v=L1vXCYZAYYM, what is the highest number of bird species to be on camera simultaneously?",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": null
+    },
+    {
+      "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
+      "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
+      "submitted_answer": "right",
+      "correct": null
+    },
+    {
+      "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+      "question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009 (included)? You can use the latest 2022 version of english wikipedia.",
+      "submitted_answer": "Unable to answer",
+      "correct": null
+    },
+    {
+      "task_id": "4fc2f1ae-8625-45b5-ab34-ad4433bc21f8",
+      "question": "Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?",
+      "submitted_answer": "Cas Liber",
+      "correct": null
+    },
+    {
+      "task_id": "cca530fc-4052-43b2-b130-b30968d8aa44",
+      "question": "Review the chess position provided in the image. It is black's turn. Provide the correct next move for black which guarantees a win. Please provide your response in algebraic notation.",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": null
+    },
+    {
+      "task_id": "9d191bce-651d-4746-be2d-7ef8ecadb9c2",
+      "question": "Examine the video at https://www.youtube.com/watch?v=1htKBjuUWec.\n\nWhat does Teal'c say in response to the question \"Isn't that hot?\"",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool vision failed: Exception: Vision analysis failed - Gemini and Claude both failed",
+      "correct": null
+    },
+    {
+      "task_id": "6f37996b-2ac7-44b0-8e68-6d28256631b4",
+      "question": "Given this table defining * on the set S = {a, b, c, d, e}\n\n|*|a|b|c|d|e|\n|---|---|---|---|---|---|\n|a|a|b|c|b|d|\n|b|b|c|a|e|c|\n|c|c|a|b|b|a|\n|d|b|e|b|e|d|\n|e|d|b|a|d|c|\n\nprovide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: FileNotFoundError: Text file not found: table.csv",
+      "correct": null
+    },
+    {
+      "task_id": "99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3",
+      "question": "Hi, I'm making a pie but I could use some help with my shopping list. I have everything I need for the crust, but I'm not sure about the filling. I got the recipe from my friend Aditi, but she left it as a voice memo and the speaker on my phone is buzzing so I can't quite make out what she's saying. Could you please listen to the recipe and list all of the ingredients that my friend described? I only want the ingredients for the filling, as I have everything I need to make my favorite pie crust. I've attached the recipe as Strawberry pie.mp3.\n\nIn your response, please only list the ingredients, not any measurements. So if the recipe calls for \"a pinch of salt\" or \"two cups of ripe strawberries\" the ingredients on the list would be \"salt\" and \"ripe strawberries\".\n\nPlease format your response as a comma separated list of ingredients. Also, please alphabetize the ingredients.",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: .mp3. Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
+      "correct": null
+    },
+    {
+      "task_id": "f918266a-b3e0-4914-865d-4faa564f1aef",
+      "question": "What is the final numeric output from the attached Python code?",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: . Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
+      "correct": null
+    },
+    {
+      "task_id": "3cef3a44-215e-4aed-8e3b-b1e3f08063b7",
+      "question": "I'm making a grocery list for my mom, but she's a professor of botany and she's a real stickler when it comes to categorizing things. I need to add different foods to different categories on the grocery list, but if I make a mistake, she won't buy anything inserted in the wrong category. Here's the list I have so far:\n\nmilk, eggs, flour, whole bean coffee, Oreos, sweet potatoes, fresh basil, plums, green beans, rice, corn, bell pepper, whole allspice, acorns, broccoli, celery, zucchini, lettuce, peanuts\n\nI need to make headings for the fruits and vegetables. Could you please create a list of just the vegetables from my list? If you could do that, then I can figure out how to categorize the rest of the list into the appropriate categories. But remember that my mom is a real stickler, so make sure that no botanical fruits end up on the vegetable list, or she won't get them when she's at the store. Please alphabetize the list of vegetables, and place each item in a comma separated list.",
+      "submitted_answer": "acorns, bell pepper, broccoli, celery, green beans, lettuce, zucchini",
+      "correct": null
+    },
+    {
+      "task_id": "cabe07ed-9eca-40ea-8ead-410ef5e83f91",
+      "question": "What is the surname of the equine veterinarian mentioned in 1.E Exercises from the chemistry materials licensed by Marisa Alviar-Agnew & Henry Agnew under the CK-12 license in LibreText's Introductory Chemistry materials as compiled 08/21/2023?",
+      "submitted_answer": "Unable to answer",
+      "correct": null
+    },
+    {
+      "task_id": "305ac316-eef6-4446-960a-92d80d542f82",
+      "question": "Who did the actor who played Ray in the Polish-language version of Everybody Loves Raymond play in Magda M.? Give only the first name.",
+      "submitted_answer": "Bartłomiej",
+      "correct": null
+    },
+    {
+      "task_id": "1f975693-876d-457b-a649-393859e79bf3",
+      "question": "Hi, I was out sick from my classes on Friday, so I'm trying to figure out what I need to study for my Calculus mid-term next week. My friend from class sent me an audio recording of Professor Willowbrook giving out the recommended reading for the test, but my headphones are broken :(\n\nCould you please listen to the recording for me and tell me the page numbers I'm supposed to go over? I've attached a file called Homework.mp3 that has the recording. Please provide just the page numbers as a comma-delimited list. And please provide the list in ascending order.",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: .mp3. Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
+      "correct": null
+    },
+    {
+      "task_id": "3f57289b-8c60-48be-bd80-01f8099ca449",
+      "question": "How many at bats did the Yankee with the most walks in the 1977 regular season have that same season?",
+      "submitted_answer": "589",
+      "correct": null
+    },
+    {
+      "task_id": "cf106601-ab4f-4af9-b045-5295fe67b37d",
+      "question": "What country had the least number of athletes at the 1928 Summer Olympics? If there's a tie for a number of athletes, return the first in alphabetical order. Give the IOC country code as your answer.",
+      "submitted_answer": "CUB, MON",
+      "correct": null
+    },
+    {
+      "task_id": "7bd855d8-463d-4ed5-93ca-5fe35145f733",
+      "question": "The attached Excel file contains the sales of menu items for a local fast-food chain. What were the total sales that the chain made from food (not including drinks)? Express your answer in USD with two decimal places.",
+      "submitted_answer": "ERROR: No evidence collected. Details: Tool parse_file failed: ValueError: Unsupported file type: . Supported: .pdf, .xlsx, .xls, .docx, .txt, .csv",
+      "correct": null
+    },
+    {
+      "task_id": "bda648d7-d618-4883-88f4-3466eabd860e",
+      "question": "Where were the Vietnamese specimens described by Kuznetzov in Nedoshivina's 2010 paper eventually deposited? Just give me the city name without abbreviations.",
+      "submitted_answer": "St. Petersburg",
+      "correct": null
+    },
+    {
+      "task_id": "840bfca7-4f7b-481a-8794-c560c340185d",
+      "question": "On June 6, 2023, an article by Carolyn Collins Petersen was published in Universe Today. This article mentions a team that produced a paper about their observations, linked at the bottom of the article. Find this paper. Under what NASA award number was the work performed by R. G. Arendt supported by?",
+      "submitted_answer": "Unable to answer",
+      "correct": null
+    },
+    {
+      "task_id": "5a0c1adf-205e-4841-a666-7c3ef95def9d",
+      "question": "What is the first name of the only Malko Competition recipient from the 20th Century (after 1977) whose nationality on record is a country that no longer exists?",
+      "submitted_answer": "Jan",
+      "correct": null
+    },
+    {
+      "task_id": "a0c07678-e491-4bbc-8f0b-07405144218f",
+      "question": "Who are the pitchers with the number before and after Taishō Tamai's number as of July 2023? Give them to me in the form Pitcher Before, Pitcher After, use their last names only, in Roman characters.",
+      "submitted_answer": "Unable to answer",
+      "correct": null
+    }
+  ]
+}

requirements.txt CHANGED Viewed

@@ -59,3 +59,5 @@ python-dotenv>=1.0.0       # Environment variable management
 pydantic>=2.0.0            # Data validation (for StateGraph)
 typing-extensions>=4.12.0  # Type hints support
 tenacity>=8.2.0            # Retry logic with exponential backoff

 pydantic>=2.0.0            # Data validation (for StateGraph)
 typing-extensions>=4.12.0  # Type hints support
 tenacity>=8.2.0            # Retry logic with exponential backoff
+datasets==4.4.2
+huggingface-hub==1.2.3

src/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,8 @@

+"""Utility modules for GAIA agent.
+Author: @mangobee
+"""
+from .ground_truth import get_ground_truth, GAIAGroundTruth
+__all__ = ["get_ground_truth", "GAIAGroundTruth"]

src/utils/ground_truth.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""Ground truth comparison using GAIA validation dataset.
+Author: @mangobee
+Since the GAIA API only returns summary stats (X/Y correct) without per-question
+correctness, we load the public validation dataset to compare our answers locally.
+This enables per-question debugging and error analysis.
+"""
+import os
+import logging
+from typing import Dict, Optional
+logger = logging.getLogger(__name__)
+# ============================================================================
+# CONFIG
+# ============================================================================
+CACHE_DIR = os.path.expanduser("~/.cache/gaia_dataset")
+# ============================================================================
+class GAIAGroundTruth:
+    """Load GAIA validation dataset and provide ground truth answers."""
+    def __init__(self):
+        """Initialize ground truth loader."""
+        self.ground_truth: Dict[str, str] = {}
+        self._loaded = False
+    def load_validation_set(self) -> bool:
+        """Load GAIA validation dataset from HuggingFace.
+        Returns:
+            bool: True if loaded successfully, False otherwise
+        """
+        if self._loaded:
+            return True
+        try:
+            from datasets import load_dataset
+            logger.info("Loading GAIA validation dataset...")
+            # Load validation set (public answers)
+            # Using 2023_all which includes all levels
+            dataset = load_dataset(
+                "gaia-benchmark/GAIA",
+                "2023_all",
+                split="validation",
+                cache_dir=CACHE_DIR
+            )
+            # Build task_id -> final_answer mapping
+            for item in dataset:
+                task_id = item.get("task_id")
+                final_answer = item.get("Final answer")
+                if task_id and final_answer:
+                    self.ground_truth[task_id] = str(final_answer).strip()
+            self._loaded = True
+            logger.info(f"Loaded {len(self.ground_truth)} ground truth answers")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to load GAIA dataset: {e}")
+            return False
+    def get_answer(self, task_id: str) -> Optional[str]:
+        """Get ground truth answer for a task_id.
+        Args:
+            task_id: Question task ID
+        Returns:
+            Ground truth answer or None if not found
+        """
+        if not self._loaded:
+            self.load_validation_set()
+        return self.ground_truth.get(task_id)
+    def compare_answer(self, task_id: str, submitted_answer: str) -> Optional[bool]:
+        """Compare submitted answer against ground truth.
+        Args:
+            task_id: Question task ID
+            submitted_answer: Answer submitted by agent
+        Returns:
+            True if correct, False if incorrect, None if no ground truth available
+        """
+        ground_truth = self.get_answer(task_id)
+        if ground_truth is None:
+            return None
+        # Normalize both answers for comparison
+        submitted = str(submitted_answer).strip().lower()
+        expected = str(ground_truth).strip().lower()
+        # Exact match comparison
+        return submitted == expected
+# Singleton instance
+_ground_truth_instance = None
+def get_ground_truth() -> GAIAGroundTruth:
+    """Get or create singleton ground truth instance.
+    Returns:
+        GAIAGroundTruth instance
+    """
+    global _ground_truth_instance
+    if _ground_truth_instance is None:
+        _ground_truth_instance = GAIAGroundTruth()
+    return _ground_truth_instance