agentbee

Running

App Files Files Community

mangubee commited on 22 days ago

Commit

ff5bca5

1 Parent(s): 94965d6

Update JSON export with execution time and correct flags

Browse files

Files changed (2) hide show

CHANGELOG.md +67 -0
app.py +71 -14

CHANGELOG.md CHANGED Viewed

@@ -261,6 +261,73 @@
 - ✅ Concurrent execution maintains error isolation
 - ⏳ Local testing with 3 questions pending
 ### Created Files
 ### Deleted Files

 - ✅ Concurrent execution maintains error isolation
 - ⏳ Local testing with 3 questions pending
+### [PROBLEM: Evaluation Metadata Tracking - Execution Time and Correct Answers]
+**Problem:** No execution time tracking to verify async performance improvement. JSON export doesn't show which questions were answered correctly, making error analysis difficult.
+**Modified Files:**
+- **app.py** (~60 lines added/modified)
+  - Added `import time` (line 8) - For execution timing
+  - Updated `export_results_to_json()` function signature (lines 38-113)
+    - Added `execution_time` parameter (optional float)
+    - Added `submission_response` parameter (optional dict with GAIA API response)
+    - Extracts correct task_ids from `submission_response["results"]` if available
+    - Adds execution time to metadata: `execution_time_seconds` and `execution_time_formatted` (Xm Ys)
+    - Adds score info to metadata: `score_percent`, `correct_count`, `total_attempted`
+    - Adds `"correct": true/false/null` flag to each result entry
+  - Updated `run_and_submit_all()` timing tracking (lines 274-435)
+    - Added `start_time = time.time()` at function start (line 275)
+    - Added `execution_time = time.time() - start_time` before all returns
+    - Logs execution time: "Total execution time: X.XX seconds (Xm Ys)" (line 397)
+    - Updated all 6 `export_results_to_json()` calls to pass `execution_time`
+    - Successful submission: passes both `execution_time` and `result_data` (line 417)
+  - Added correct answer column to results display (lines 399-413)
+    - Extracts correct task_ids from `result_data["results"]` if available
+    - Adds "Correct?" column to `results_log` with "✅ Yes" or "❌ No"
+    - Falls back to summary message if per-question data unavailable
+**Benefits:**
+- ✅ **Performance verification:** Track actual execution time to confirm async speedup (expect 60-80s vs previous 240s)
+- ✅ **Correct answer identification:** JSON export shows which questions were answered correctly
+- ✅ **Error analysis:** Easy to identify patterns in incorrect answers for debugging
+- ✅ **Progress tracking:** Execution time metadata enables historical performance comparison
+- ✅ **User visibility:** Results table shows "Correct?" column with clear visual indicators (✅/❌)
+**JSON Export Format:**
+```json
+{
+  "metadata": {
+    "generated": "2026-01-04 18:30:00",
+    "timestamp": "20260104_183000",
+    "total_questions": 20,
+    "execution_time_seconds": 78.45,
+    "execution_time_formatted": "1m 18s",
+    "score_percent": 20.0,
+    "correct_count": 4,
+    "total_attempted": 20
+  },
+  "results": [
+    {
+      "task_id": "abc123",
+      "question": "...",
+      "submitted_answer": "...",
+      "correct": true
+    }
+  ]
+}
+```
+**Verification:**
+- ✅ No syntax errors in app.py
+- ✅ Execution time tracking added at function start and all return points
+- ✅ All export_results_to_json calls updated with new parameters
+- ✅ Correct answer parsing from submission response implemented
+- ⏳ Testing with real GAIA submission pending
 ### Created Files
 ### Deleted Files

app.py CHANGED Viewed

@@ -5,6 +5,7 @@ import inspect
 import pandas as pd
 import logging
 import json
 from concurrent.futures import ThreadPoolExecutor, as_completed
 # Stage 1: Import GAIAAgent (LangGraph-based agent)
@@ -35,12 +36,19 @@ def check_api_keys():
     return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
-def export_results_to_json(results_log: list, submission_status: str) -> str:
     """Export evaluation results to JSON file for easy processing.
     - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
     - HF Spaces: Saves to ./exports/gaia_results_TIMESTAMP.json
     - Format: Clean JSON with full error messages, no truncation
     """
     from datetime import datetime
@@ -58,19 +66,41 @@ def export_results_to_json(results_log: list, submission_status: str) -> str:
         downloads_dir = os.path.expanduser("~/Downloads")
         filepath = os.path.join(downloads_dir, filename)
     # Build JSON structure
     export_data = {
-        "metadata": {
-            "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
-            "timestamp": timestamp,
-            "total_questions": len(results_log)
-        },
         "submission_status": submission_status,
         "results": [
             {
                 "task_id": result.get("Task ID", "N/A"),
                 "question": result.get("Question", "N/A"),
-                "submitted_answer": result.get("Submitted Answer", "N/A")
             }
             for result in results_log
         ]
@@ -241,6 +271,9 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
     """
     # --- Determine HF Space Runtime URL and Repo URL ---
     space_id = os.getenv("SPACE_ID")  # Get the SPACE_ID for sending link to the code
@@ -333,7 +366,8 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
         print("Agent did not produce any answers to submit.")
         status_message = "Agent did not produce any answers to submit."
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message)
         return status_message, results_df, export_path
     # 4. Prepare Submission
@@ -359,9 +393,28 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
             f"Message: {result_data.get('message', 'No message received.')}"
         )
         print("Submission successful.")
         results_df = pd.DataFrame(results_log)
-        # Export to JSON
-        export_path = export_results_to_json(results_log, final_status)
         return final_status, results_df, export_path
     except requests.exceptions.HTTPError as e:
         error_detail = f"Server responded with status {e.response.status_code}."
@@ -372,26 +425,30 @@ def run_and_submit_all(llm_provider: str, enable_fallback: bool, profile: gr.OAu
             error_detail += f" Response: {e.response.text[:500]}"
         status_message = f"Submission Failed: {error_detail}"
         print(status_message)
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message)
         return status_message, results_df, export_path
     except requests.exceptions.Timeout:
         status_message = "Submission Failed: The request timed out."
         print(status_message)
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message)
         return status_message, results_df, export_path
     except requests.exceptions.RequestException as e:
         status_message = f"Submission Failed: Network error - {e}"
         print(status_message)
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message)
         return status_message, results_df, export_path
     except Exception as e:
         status_message = f"An unexpected error occurred during submission: {e}"
         print(status_message)
         results_df = pd.DataFrame(results_log)
-        export_path = export_results_to_json(results_log, status_message)
         return status_message, results_df, export_path

 import pandas as pd
 import logging
 import json
+import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 # Stage 1: Import GAIAAgent (LangGraph-based agent)
     return "\n".join([f"{k}: {v}" for k, v in keys_status.items()])
+def export_results_to_json(results_log: list, submission_status: str, execution_time: float = None,
+                          submission_response: dict = None) -> str:
     """Export evaluation results to JSON file for easy processing.
     - Local: Saves to ~/Downloads/gaia_results_TIMESTAMP.json
     - HF Spaces: Saves to ./exports/gaia_results_TIMESTAMP.json
     - Format: Clean JSON with full error messages, no truncation
+    Args:
+        results_log: List of question results
+        submission_status: Status message from submission
+        execution_time: Total execution time in seconds
+        submission_response: Response from GAIA API with correctness info
     """
     from datetime import datetime
         downloads_dir = os.path.expanduser("~/Downloads")
         filepath = os.path.join(downloads_dir, filename)
+    # Extract correctness info from submission response if available
+    correct_task_ids = set()
+    if submission_response and "results" in submission_response:
+        # If API provides per-question results
+        for item in submission_response.get("results", []):
+            if item.get("correct"):
+                correct_task_ids.add(item.get("task_id"))
     # Build JSON structure
+    metadata = {
+        "generated": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+        "timestamp": timestamp,
+        "total_questions": len(results_log)
+    }
+    # Add execution time if available
+    if execution_time is not None:
+        metadata["execution_time_seconds"] = round(execution_time, 2)
+        metadata["execution_time_formatted"] = f"{int(execution_time // 60)}m {int(execution_time % 60)}s"
+    # Add score info if available
+    if submission_response:
+        metadata["score_percent"] = submission_response.get("score")
+        metadata["correct_count"] = submission_response.get("correct_count")
+        metadata["total_attempted"] = submission_response.get("total_attempted")
     export_data = {
+        "metadata": metadata,
         "submission_status": submission_status,
         "results": [
             {
                 "task_id": result.get("Task ID", "N/A"),
                 "question": result.get("Question", "N/A"),
+                "submitted_answer": result.get("Submitted Answer", "N/A"),
+                "correct": result.get("Task ID") in correct_task_ids if correct_task_ids else None
             }
             for result in results_log
         ]
     Fetches all questions, runs the BasicAgent on them, submits all answers,
     and displays the results.
     """
+    # Start execution timer
+    start_time = time.time()
     # --- Determine HF Space Runtime URL and Repo URL ---
     space_id = os.getenv("SPACE_ID")  # Get the SPACE_ID for sending link to the code
         print("Agent did not produce any answers to submit.")
         status_message = "Agent did not produce any answers to submit."
         results_df = pd.DataFrame(results_log)
+        execution_time = time.time() - start_time
+        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     # 4. Prepare Submission
             f"Message: {result_data.get('message', 'No message received.')}"
         )
         print("Submission successful.")
+        execution_time = time.time() - start_time
+        logger.info(f"Total execution time: {execution_time:.2f} seconds ({int(execution_time // 60)}m {int(execution_time % 60)}s)")
+        # Extract correct task_ids from result_data if available
+        correct_task_ids = set()
+        if "results" in result_data:
+            for item in result_data.get("results", []):
+                if item.get("correct"):
+                    correct_task_ids.add(item.get("task_id"))
+        # Add "Correct?" column to results
+        for result in results_log:
+            task_id = result.get("Task ID")
+            if correct_task_ids:
+                result["Correct?"] = "✅ Yes" if task_id in correct_task_ids else "❌ No"
+            else:
+                # If no per-question data, show summary info
+                result["Correct?"] = f"See summary: {result_data.get('correct_count', '?')}/{result_data.get('total_attempted', '?')} correct"
         results_df = pd.DataFrame(results_log)
+        # Export to JSON with execution time and submission response
+        export_path = export_results_to_json(results_log, final_status, execution_time, result_data)
         return final_status, results_df, export_path
     except requests.exceptions.HTTPError as e:
         error_detail = f"Server responded with status {e.response.status_code}."
             error_detail += f" Response: {e.response.text[:500]}"
         status_message = f"Submission Failed: {error_detail}"
         print(status_message)
+        execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except requests.exceptions.Timeout:
         status_message = "Submission Failed: The request timed out."
         print(status_message)
+        execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except requests.exceptions.RequestException as e:
         status_message = f"Submission Failed: Network error - {e}"
         print(status_message)
+        execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path
     except Exception as e:
         status_message = f"An unexpected error occurred during submission: {e}"
         print(status_message)
+        execution_time = time.time() - start_time
         results_df = pd.DataFrame(results_log)
+        export_path = export_results_to_json(results_log, status_message, execution_time)
         return status_message, results_df, export_path