agentbee

Sleeping

mangubee Claude commited on Jan 12

Commit

41ac444

1 Parent(s): 7d0cc73

feat: system error field, calculator fix, target task IDs, course vs GAIA docs

Changes:
- System error field: Changed to boolean yes/no with error_log
- Calculator threading fix: Handle signal.alarm() failure in threads
- Target task IDs: Debug feature to run specific questions
- Course vs GAIA: Documentation distinguishing course API from official GAIA
- Quick test script: test/test_quick_fixes.py for targeted testing

Modified:
- app.py: System error field, target task IDs UI, submission logic
- src/tools/calculator.py: Thread-safe timeout handling
- src/agent/graph.py: Evidence formatting for dict results
- src/agent/llm_client.py: Fallback mechanism archived
- CHANGELOG.md: All changes documented
- README.md: Added submission guide reference

Added:
- docs/gaia_submission_guide.md: Complete submission guide
- test/test_quick_fixes.py: Targeted question testing

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (10) hide show

.gitignore +4 -0
CHANGELOG.md +278 -0
PLAN.md +165 -0
README.md +2 -0
app.py +178 -32
docs/gaia_submission_guide.md +120 -0
src/agent/graph.py +100 -17
src/agent/llm_client.py +83 -54
src/tools/calculator.py +18 -5
test/test_quick_fixes.py +162 -0

.gitignore CHANGED Viewed

@@ -29,6 +29,10 @@ Thumbs.db
 # Input documents (PDFs not allowed in HF Spaces)
 input/*.pdf
 # Runtime cache (not in git, served via app download)
 _cache/

 # Input documents (PDFs not allowed in HF Spaces)
 input/*.pdf
+input/
+# Downloaded GAIA question files
+input/*
 # Runtime cache (not in git, served via app download)
 _cache/

CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,283 @@
 # Session Changelog
 ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
 **Problem:** Need to validate HF vision works before complex GAIA evaluation.

 # Session Changelog
+## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
+**Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
+**Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
+**Solution:** Created `docs/gaia_submission_guide.md` documenting:
+- **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
+- **Official GAIA Leaderboard** (future): 450+ questions, different submission format
+- API routes, submission formats, scoring differences
+- Development workflow for both
+**Key Clarifications:**
+| Aspect | Course | Official GAIA |
+|--------|--------|--------------|
+| API | `agents-course-unit4-scoring.hf.space` | `gaia-benchmark/leaderboard` Space |
+| Questions | 20 (level 1) | 450+ (all levels) |
+| Target | 30% (6/20) | Competitive placement |
+| Debug features | Target Task IDs, Question Limit | Must submit ALL |
+| Submission | JSON POST | File upload |
+**Created Files:**
+- **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
+**Modified Files:**
+- **README.md** - Added note linking to submission guide
+---
+## [2026-01-12] [Feature] [COMPLETED] Target Specific Task IDs
+**Problem:** No way to run specific questions for debugging. Had to run full evaluation or use "first N" limit, which is inefficient for targeted fixes.
+**Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
+**Implementation:**
+- Added `eval_task_ids` textbox in UI (line 763-770)
+- Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
+- Filtering logic: Parses comma-separated IDs, filters `questions_data`
+- Shows missing IDs warning if task_id not found in dataset
+- Overrides question_limit when provided
+**Usage:**
+```
+Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
+```
+**Modified Files:**
+- **app.py** (~30 lines added)
+  - UI: `eval_task_ids` textbox
+  - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
+  - `run_button.click()`: Pass task_ids to function
+---
+## [2026-01-12] [Bug Fix] [COMPLETED] Calculator Threading Issue
+**Problem:** Calculator tool fails with `ValueError: signal only works in main thread of the main interpreter` when running in Gradio's ThreadPoolExecutor context.
+**Root Cause:** `signal.alarm()` only works in the main thread. Our agent uses `ThreadPoolExecutor` for concurrent processing (max_workers=5).
+**Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
+**Modified Files:**
+- **src/tools/calculator.py** (~15 lines modified)
+  - `timeout()` context manager: Try/except for signal.alarm() failure
+  - Logs warning when timeout protection disabled
+  - Gracefully handles Windows (AttributeError for SIGALRM)
+---
+## [2026-01-12] [Feature] [COMPLETED] System Error Field
+**Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
+**Solution:** Changed to boolean `system_error: yes/no` field:
+- `system_error: yes` - Technical/system error from our code (don't submit)
+- `system_error: no` - AI response (submit answer, even if wrong)
+- Added `error_log` field with full error details for system errors
+**Implementation:**
+- `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
+- Results table: "System Error" column (yes/no), "Error Log" column (when yes)
+- JSON export: `system_error` field, `error_log` field (when system error)
+- Submission logic: Only submit when `system_error == "no"`
+**Modified Files:**
+- **app.py** (~30 lines modified)
+  - `a_determine_status()`: Returns tuple instead of string
+  - `process_single_question()`: Uses new format, adds `error_log`
+  - Results table: "System Error" + "Error Log" columns
+  - `export_results_to_json()`: Include `system_error` and `error_log`
+---
+## [2026-01-12] [Refactoring] [COMPLETED] Fallback UI Removal
+**Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
+**Solution:** Removed all fallback-related UI elements:
+- Removed `enable_fallback_checkbox` from Test Question tab
+- Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
+- Removed `enable_fallback` parameter from `test_single_question()` function
+- Removed `enable_fallback` parameter from `run_and_submit_all()` function
+- Removed `ENABLE_LLM_FALLBACK` environment variable setting
+- Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
+**Modified Files:**
+- **app.py** (~20 lines removed)
+  - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
+  - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
+  - Updated `test_button.click()` inputs to remove checkbox reference
+  - Updated `run_button.click()` inputs to remove checkbox reference
+---
+## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
+**Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
+- 4 providers to test for each feature
+- Complex debugging with multiple code paths
+- Longer, less clear error messages
+- Adding complexity without clear benefit
+**Solution:** Archive fallback mechanism, use single provider only
+- Removed fallback provider loop (Gemini → HF → Groq → Claude)
+- Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
+- If provider fails, error is raised immediately
+- Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
+**Benefits:**
+- ✅ Reduced code complexity
+- ✅ Faster debugging (one code path)
+- ✅ Clearer error messages
+- ✅ No double work on features
+**Modified Files:**
+- **src/agent/llm_client.py** (~25 lines removed)
+  - Simplified `_call_with_fallback()`: Removed fallback logic
+- **dev/dev_260112_02_fallback_archived.md** (NEW)
+  - Archived fallback code documentation
+  - Migration guide for restoration if needed
+---
+## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Search Results Not Being Extracted
+**Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
+**Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
+```python
+{"results": [...], "source": "tavily", "query": "...", "count": N}
+```
+**Solution:** Handle both dict formats in evidence extraction:
+```python
+if isinstance(result, dict):
+    if "answer" in result:
+        evidence.append(result["answer"])  # Vision tools
+    elif "results" in result:
+        # Format search results as readable text
+        results_list = result.get("results", [])
+        formatted = []
+        for r in results_list[:3]:
+            formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
+        evidence.append("\n\n".join(formatted))  # Search tools
+```
+**Modified Files:**
+- **src/agent/graph.py** (~40 lines modified)
+  - Updated evidence extraction in primary path
+  - Updated evidence extraction in fallback path
+**Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
+**Summary of Fixes (Session 2026-01-12):**
+1. ✅ File download from HF dataset (5/5 files)
+2. ✅ Absolute paths from script location
+3. ✅ Evidence formatting for vision tools (dict → answer)
+4. ✅ Evidence formatting for search tools (dict → formatted text)
+---
+## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Dict Results Not Being Extracted
+**Problem:** Chess vision question returned "Unable to answer" even though vision tool correctly extracted the chess position.
+**Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
+**Solution:** Extract 'answer' field from dict results before adding to evidence:
+```python
+# Before
+evidence.append(f"[{tool_name}] {result}")  # Dict → string representation
+# After
+if isinstance(result, dict) and "answer" in result:
+    evidence.append(result["answer"])  # Extract answer field
+elif isinstance(result, str):
+    evidence.append(result)
+```
+**Modified Files:**
+- **src/agent/graph.py** (~15 lines modified)
+  - Updated `execute_node()`: Extract 'answer' from dict results
+  - Fixed both primary and fallback execution paths
+**Test Result:** Simple search questions now work. Chess question still fails due to vision tool extracting wrong turn indicator (w instead of b).
+**Known Issue:** Vision tool extracts "w - - 0 1" (White's turn) but question asks for Black's move. Ground truth is "Rd5" (Black move), so FEN extraction may have error.
+---
+## [2026-01-12] [File Download Fix] [COMPLETED] Absolute Path Fix - Vision Tool Now Works
+**Problem:** Chess vision question returned "Unable to answer" even though file was downloaded successfully.
+**Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
+**Solution:** Return absolute paths from `download_task_file()`
+- Changed: `target_path = os.path.join(save_dir, file_name)`
+- To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
+- Now tools can find files regardless of working directory
+**Modified Files:**
+- **app.py** (~3 lines modified)
+  - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
+**Test Result:** Vision tool now works with absolute path - correctly analyzes chess position
+---
+## [2026-01-12] [File Download Fix] [COMPLETED] GAIA File API Dead End - Switch to HF Dataset
+**Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
+**Investigation:**
+- Checked API spec: Endpoint exists with proper documentation
+- Tested download: HTTP 404 "No file path associated with task_id"
+- Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
+- Confirmed via Swagger UI: Same 404 error
+**Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
+**Solution:** Switch from evaluation API to GAIA dataset download
+- Use `huggingface_hub.hf_hub_download()` to fetch files
+- Download to `_cache/gaia_files/` (runtime cache)
+- File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
+- Added cache checking (reuse downloaded files)
+**Files with attachments (5/20 questions):**
+- `cca530fc`: Chess position image (.png)
+- `99c9cc74`: Pie recipe audio (.mp3)
+- `f918266a`: Python code (.py)
+- `1f975693`: Calculus audio (.mp3)
+- `7bd855d8`: Menu sales Excel (.xlsx)
+**Modified Files:**
+- **app.py** (~70 lines modified)
+  - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
+  - Changed signature: `download_task_file(task_id, file_name, save_dir)`
+  - Added `huggingface_hub` import with cache checking
+  - Default directory: `_cache/gaia_files/` (runtime cache, not git)
+  - Flat file structure: `_cache/gaia_files/{file_name}`
+- **app.py** (~5 lines modified)
+  - Updated `process_single_question()`: Pass `file_name` to download function
+**Known Limitations:**
+- Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
+- `.mp3` audio files still unsupported
+- `.py` code execution still unsupported
+**Next Steps:**
+1. Test new download implementation
+2. Expand tool support for .mp3 (audio transcription)
+3. Expand tool support for .py (code execution)
+---
 ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
 **Problem:** Need to validate HF vision works before complex GAIA evaluation.

PLAN.md CHANGED Viewed

@@ -531,3 +531,168 @@ If Phase 0 reveals HF Inference API doesn't support vision:
 2. Test simple vision API call with Phi-3.5-vision-instruct
 3. Document working pattern or confirm API doesn't support vision
 4. Decision gate: GO to Phase 1 or pivot to backup options

 2. Test simple vision API call with Phi-3.5-vision-instruct
 3. Document working pattern or confirm API doesn't support vision
 4. Decision gate: GO to Phase 1 or pivot to backup options
+---
+## Phase 7: GAIA File Attachment Support
+**Goal:** Enable agent to download and process file attachments from GAIA questions
+**Problem:**
+- Current code ignores `file_name` field in GAIA questions
+- Files not downloaded from `GET /files/{task_id}` endpoint
+- Vision/file parsing tools fail with placeholder `<provided_image_path>`
+- ~40% of questions (8/20) fail due to missing file handling
+### Root Cause
+**GAIA Question Structure:**
+```json
+{
+  "task_id": "abc123",
+  "question": "What's in this image?",
+  "file_name": "chess.png",      // NULL if no file
+  "file_path": "/files/abc123"   // NULL if no file
+}
+```
+**Current Code (app.py:249-290):**
+```python
+def process_single_question(agent, item, index, total):
+    task_id = item.get("task_id")
+    question_text = item.get("question")
+    # ❌ MISSING: Check file_name
+    # ❌ MISSING: Download file
+    # ❌ MISSING: Pass file_path to agent
+    submitted_answer = agent(question_text)  # No file handling
+```
+**Result:** LLM generates `vision(image_path="<provided_image_path>")` → File not found error
+### Solution Architecture
+**Step 1: Add File Download Function**
+```python
+def download_task_file(task_id: str, save_dir: str = "input/") -> Optional[str]:
+    """Download file attached to a GAIA question.
+    Args:
+        task_id: Question's task_id
+        save_dir: Directory to save file
+    Returns:
+        File path if downloaded, None if no file
+    """
+    api_url = "https://agents-course-unit4-scoring.hf.space"
+    file_url = f"{api_url}/files/{task_id}"
+    response = requests.get(file_url, timeout=30)
+    response.raise_for_status()
+    # Get extension from Content-Type header
+    content_type = response.headers.get('Content-Type', '')
+    extension_map = {
+        'image/png': '.png',
+        'image/jpeg': '.jpg',
+        'application/pdf': '.pdf',
+        'text/csv': '.csv',
+        'application/json': '.json',
+        'application/vnd.ms-excel': '.xls',
+        'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': '.xlsx',
+    }
+    extension = extension_map.get(content_type, '.file')
+    # Save file
+    Path(save_dir).mkdir(exist_ok=True)
+    file_path = f"{save_dir}{task_id}{extension}"
+    with open(file_path, 'wb') as f:
+        f.write(response.content)
+    return file_path
+```
+**Step 2: Modify Question Processing**
+```python
+def process_single_question(agent, item, index, total):
+    task_id = item.get("task_id")
+    question_text = item.get("question")
+    file_name = item.get("file_name")  # ✅ NEW
+    # Download file if exists
+    file_path = None
+    if file_name:
+        file_path = download_task_file(task_id)
+    # Pass file info to agent
+    submitted_answer = agent(question_text, file_path=file_path)  # ✅ NEW
+```
+**Step 3: Update LLM Context**
+When file_path is provided, include it in the planning prompt:
+```python
+if file_path:
+    question_context = f"Question: {question_text}\nAttached file: {file_path}"
+else:
+    question_context = question_text
+```
+### Implementation Steps
+#### Step 7.1: Add File Download Function
+- [ ] Create `download_task_file()` in app.py
+- [ ] Handle Content-Type to extension mapping
+- [ ] Handle 404 gracefully (no file for this task)
+- [ ] Create input/ directory if not exists
+#### Step 7.2: Modify Question Processing Loop
+- [ ] Check `item.get("file_name")` in process_single_question
+- [ ] Call download_task_file() if file_name exists
+- [ ] Pass file_path to agent invocation
+#### Step 7.3: Update Agent to Handle file_path
+- [ ] Modify agent to accept optional file_path parameter
+- [ ] Include file info in planning prompt
+- [ ] Update tool selection to use real file paths
+#### Step 7.4: Test File Handling
+- [ ] Test with image question (chess position)
+- [ ] Test with document question (Excel file)
+- [ ] Verify no more `<provided_image_path>` errors
+### Files to Modify
+1. **app.py** (~80 lines added/modified)
+   - Add download_task_file() function
+   - Modify process_single_question() to handle files
+   - Add input/ directory creation
+2. **src/agent/graph.py** (~20 lines)
+   - Update agent state to include file_path
+   - Pass file info to planning prompt
+3. **.gitignore** (~2 lines)
+   - Add input/ to ignore downloaded files
+### Success Criteria
+- [ ] Image questions: Vision tool receives real file path
+- [ ] Document questions: parse_file tool receives real file path
+- [ ] No more `<provided_image_path>` errors
+- [ ] Accuracy improves from 10% toward 30%+
+### Expected Impact
+| Before | After |
+|--------|-------|
+| 40% (8/20) fail with file errors | 0% file errors |
+| Vision questions: All fail | Vision questions: Can work |
+| Document questions: All fail | Document questions: Can work |
+| Max accuracy: ~60% | Max accuracy: ~100% potential |

README.md CHANGED Viewed

@@ -348,6 +348,8 @@ ENABLE_LLM_FALLBACK=false    # Disable fallback for debugging single provider
 **Test Coverage:** 99 passing tests (~2min 40sec runtime)
 ## Workflow
 ### Dev Record Workflow

 **Test Coverage:** 99 passing tests (~2min 40sec runtime)
+> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
 ## Workflow
 ### Dev Record Workflow

app.py CHANGED Viewed

@@ -1,17 +1,18 @@
 import os
 import gradio as gr
 import requests
-import inspect
 import pandas as pd
 import logging
 import json
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 # Stage 1: Import GAIAAgent (LangGraph-based agent)
 from src.agent import GAIAAgent
 # Import ground truth comparison
 from src.utils.ground_truth import get_ground_truth
 # Configure logging
@@ -99,9 +100,14 @@ def export_results_to_json(
         result_dict = {
             "task_id": result.get("Task ID", "N/A"),
             "question": result.get("Question", "N/A"),
             "submitted_answer": result.get("Submitted Answer", "N/A"),
         }
         # Add correctness if available
         if result.get("Correct?"):
             result_dict["correct"] = (
@@ -201,7 +207,81 @@ def format_diagnostics(final_state: dict) -> str:
     return "\n".join(diagnostics)
-def test_single_question(question: str, llm_provider: str, enable_fallback: bool):
     """Test agent with a single question and return diagnostics."""
     if not question or not question.strip():
         return "Please enter a question.", "", check_api_keys()
@@ -209,11 +289,8 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
     try:
         # Set LLM provider from UI selection (overrides .env)
         os.environ["LLM_PROVIDER"] = llm_provider.lower()
-        os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
-        logger.info(
-            f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
-        )
         # Initialize agent
         agent = GAIAAgent()
@@ -225,7 +302,7 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
         final_state = agent.last_state or {}
         # Format diagnostics with LLM provider info
-        provider_info = f"**LLM Provider:** {llm_provider} (Fallback: {'Enabled' if enable_fallback else 'Disabled'})\n\n"
         diagnostics = provider_info + format_diagnostics(final_state)
         api_status = check_api_keys()
@@ -246,12 +323,34 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
 # Stage 6: Async processing with ThreadPoolExecutor
 def process_single_question(agent, item, index, total):
     """Process single question with agent, return result with error handling.
     Args:
         agent: GAIAAgent instance
-        item: Question item dict with task_id and question
         index: Question index (0-based)
         total: Total number of questions
@@ -260,40 +359,64 @@ def process_single_question(agent, item, index, total):
     """
     task_id = item.get("task_id")
     question_text = item.get("question")
     if not task_id or question_text is None:
         return {
             "task_id": task_id,
             "question": question_text,
-            "answer": "ERROR: Missing task_id or question",
             "error": True,
         }
     try:
         logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
-        submitted_answer = agent(question_text)
         logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": submitted_answer,
             "error": False,
         }
     except Exception as e:
         logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
         return {
             "task_id": task_id,
             "question": question_text,
-            "answer": f"ERROR: {str(e)}",
             "error": True,
         }
 def run_and_submit_all(
     llm_provider: str,
-    enable_fallback: bool,
     question_limit: int = 0,
     profile: gr.OAuthProfile | None = None,
 ):
     """
@@ -302,8 +425,8 @@ def run_and_submit_all(
     Args:
         llm_provider: LLM provider to use
-        enable_fallback: Whether to enable fallback to other providers
         question_limit: Limit number of questions (0 = process all)
         profile: OAuth profile for HF login
     """
     # Start execution timer
@@ -325,10 +448,7 @@ def run_and_submit_all(
     # Set LLM provider from UI selection (overrides .env)
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
-    os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
-    logger.info(
-        f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
-    )
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
@@ -366,6 +486,27 @@ def run_and_submit_all(
                 f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
             )
         print(f"Processing {len(questions_data)} questions.")
     except requests.exceptions.RequestException as e:
         print(f"Error fetching questions: {e}")
@@ -421,9 +562,16 @@ def run_and_submit_all(
             result_entry = {
                 "Task ID": result["task_id"],
                 "Question": result["question"],
-                "Submitted Answer": result["answer"],
             }
             # Add ground truth data if available
             if is_correct is not None:
                 result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
@@ -433,8 +581,8 @@ def run_and_submit_all(
             results_log.append(result_entry)
-            # Add to submission payload if no error
-            if not result["error"]:
                 answers_payload.append(
                     {"task_id": result["task_id"], "submitted_answer": result["answer"]}
                 )
@@ -575,11 +723,6 @@ with gr.Blocks() as demo:
                     value="HuggingFace",
                     info="Select which LLM to use for this test",
                 )
-                enable_fallback_checkbox = gr.Checkbox(
-                    label="Enable Fallback",
-                    value=False,
-                    info="If enabled, falls back to other providers on failure",
-                )
             test_button = gr.Button("Run Test", variant="primary")
@@ -601,7 +744,6 @@ with gr.Blocks() as demo:
                 inputs=[
                     test_question_input,
                     llm_provider_dropdown,
-                    enable_fallback_checkbox,
                 ],
                 outputs=[test_answer_output, test_diagnostics_output, test_api_status],
             )
@@ -632,11 +774,6 @@ with gr.Blocks() as demo:
                     value="HuggingFace",
                     info="Select which LLM to use for all questions",
                 )
-                eval_enable_fallback_checkbox = gr.Checkbox(
-                    label="Enable Fallback",
-                    value=True,
-                    info="Recommended: Enable fallback for production evaluation",
-                )
                 eval_question_limit = gr.Number(
                     label="Question Limit (Debug)",
                     value=0,
@@ -646,6 +783,15 @@ with gr.Blocks() as demo:
                     info="Limit questions for testing (0 = process all)",
                 )
             run_button = gr.Button("Run Evaluation & Submit All Answers")
             status_output = gr.Textbox(
@@ -660,8 +806,8 @@ with gr.Blocks() as demo:
                 fn=run_and_submit_all,
                 inputs=[
                     eval_llm_provider_dropdown,
-                    eval_enable_fallback_checkbox,
                     eval_question_limit,
                 ],
                 outputs=[status_output, results_table, export_output],
             )

 import os
 import gradio as gr
 import requests
 import pandas as pd
 import logging
 import json
 import time
+from pathlib import Path
 from concurrent.futures import ThreadPoolExecutor, as_completed
 # Stage 1: Import GAIAAgent (LangGraph-based agent)
 from src.agent import GAIAAgent
 # Import ground truth comparison
 from src.utils.ground_truth import get_ground_truth
 # Configure logging
         result_dict = {
             "task_id": result.get("Task ID", "N/A"),
             "question": result.get("Question", "N/A"),
+            "system_error": result.get("System Error", "no"),
             "submitted_answer": result.get("Submitted Answer", "N/A"),
         }
+        # Add error log if system error
+        if result.get("System Error") == "yes" and result.get("Error Log"):
+            result_dict["error_log"] = result.get("Error Log")
         # Add correctness if available
         if result.get("Correct?"):
             result_dict["correct"] = (
     return "\n".join(diagnostics)
+def download_task_file(
+    task_id: str, file_name: str, save_dir: str = "_cache/gaia_files/"
+):
+    """Download file attached to a GAIA question from the GAIA dataset on HuggingFace.
+    The evaluation API's /files/{task_id} endpoint returns 404 because files are not
+    hosted there. Files must be downloaded from the official GAIA dataset instead.
+    Files are cached in _cache/ directory (runtime cache, not in git).
+    Args:
+        task_id: Question's task_id (used for logging)
+        file_name: Original file name from API (e.g., "task_id.png")
+        save_dir: Directory to save file (created if not exists)
+    Returns:
+        File path if downloaded successfully, None if download failed
+    """
+    import shutil
+    from huggingface_hub import hf_hub_download
+    import tempfile
+    # GAIA dataset file structure: 2023/validation/{task_id}.{ext}
+    # Extract file extension from file_name
+    _, ext = os.path.splitext(file_name)
+    ext = ext.lower()
+    # Try validation set first (most questions are from validation)
+    repo_id = "gaia-benchmark/GAIA"
+    possible_paths = [
+        f"2023/validation/{task_id}{ext}",
+        f"2023/test/{task_id}{ext}",
+    ]
+    # Create save directory if not exists (relative to script location)
+    # Use script's directory as base to ensure paths work in all environments (local, HF Space)
+    script_dir = Path(__file__).parent.absolute()
+    cache_dir = script_dir / save_dir
+    cache_dir.mkdir(exist_ok=True, parents=True)
+    target_path = str(cache_dir / file_name)
+    # Check if file already exists in cache (use absolute path for check)
+    if os.path.exists(target_path):
+        logger.info(f"Using cached file for {task_id}: {target_path}")
+        return target_path
+    # Try each possible path
+    for dataset_path in possible_paths:
+        try:
+            logger.info(f"Attempting to download {dataset_path} from GAIA dataset...")
+            # Download to temp dir first to get the file
+            with tempfile.TemporaryDirectory() as temp_dir:
+                downloaded_path = hf_hub_download(
+                    repo_id=repo_id,
+                    filename=dataset_path,
+                    repo_type="dataset",
+                    local_dir=temp_dir,
+                )
+                # Copy file to target location (flat structure in cache)
+                shutil.copy(downloaded_path, target_path)
+            logger.info(f"Downloaded file for {task_id}: {target_path}")
+            return target_path
+        except Exception as e:
+            logger.debug(f"Path {dataset_path} not found: {e}")
+            continue
+    logger.warning(f"File not found in GAIA dataset for task {task_id}")
+    return None
+def test_single_question(question: str, llm_provider: str):
     """Test agent with a single question and return diagnostics."""
     if not question or not question.strip():
         return "Please enter a question.", "", check_api_keys()
     try:
         # Set LLM provider from UI selection (overrides .env)
         os.environ["LLM_PROVIDER"] = llm_provider.lower()
+        logger.info(f"UI Config: LLM_PROVIDER={llm_provider}")
         # Initialize agent
         agent = GAIAAgent()
         final_state = agent.last_state or {}
         # Format diagnostics with LLM provider info
+        provider_info = f"**LLM Provider:** {llm_provider}\n\n"
         diagnostics = provider_info + format_diagnostics(final_state)
         api_status = check_api_keys()
 # Stage 6: Async processing with ThreadPoolExecutor
+def a_determine_status(answer: str) -> tuple[bool, str | None]:
+    """Determine if response is system error or AI answer.
+    Returns:
+        (is_system_error, error_log)
+        - is_system_error: True if system error, False if AI answer
+        - error_log: Full error message if system error, None otherwise
+    """
+    if not answer:
+        return True, "Empty answer"
+    answer_lower = answer.lower().strip()
+    # System/technical errors from our code
+    if answer_lower.startswith("error:") or "no evidence collected" in answer_lower:
+        return True, answer  # Full error message as log
+    # Everything else is an AI response (including "Unable to answer")
+    return False, None
 def process_single_question(agent, item, index, total):
     """Process single question with agent, return result with error handling.
+    Supports file attachments - downloads files before processing.
     Args:
         agent: GAIAAgent instance
+        item: Question item dict with task_id, question, and optional file_name
         index: Question index (0-based)
         total: Total number of questions
     """
     task_id = item.get("task_id")
     question_text = item.get("question")
+    file_name = item.get("file_name")
     if not task_id or question_text is None:
+        answer = "ERROR: Missing task_id or question"
+        is_error, error_log = a_determine_status(answer)
         return {
             "task_id": task_id,
             "question": question_text,
+            "answer": answer,
+            "system_error": "yes" if is_error else "no",
+            "error_log": error_log,
             "error": True,
         }
+    # Download file if question has attachment
+    file_path = None
+    if file_name:
+        file_path = download_task_file(task_id, file_name)
+        if file_path:
+            logger.info(f"[{index + 1}/{total}] File downloaded: {file_path}")
+        else:
+            logger.warning(f"[{index + 1}/{total}] File expected but not downloaded")
     try:
         logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
+        # Pass file_path to agent if available
+        submitted_answer = agent(question_text, file_path=file_path)
         logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
+        is_error, error_log = a_determine_status(submitted_answer)
         return {
             "task_id": task_id,
             "question": question_text,
             "answer": submitted_answer,
+            "system_error": "yes" if is_error else "no",
+            "error_log": error_log,
             "error": False,
         }
     except Exception as e:
         logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
+        answer = f"ERROR: {str(e)}"
+        is_error, error_log = a_determine_status(answer)
         return {
             "task_id": task_id,
             "question": question_text,
+            "answer": answer,
+            "system_error": "yes" if is_error else "no",
+            "error_log": error_log,
             "error": True,
         }
 def run_and_submit_all(
     llm_provider: str,
     question_limit: int = 0,
+    task_ids: str = "",
     profile: gr.OAuthProfile | None = None,
 ):
     """
     Args:
         llm_provider: LLM provider to use
         question_limit: Limit number of questions (0 = process all)
+        task_ids: Comma-separated task IDs to target (overrides question_limit)
         profile: OAuth profile for HF login
     """
     # Start execution timer
     # Set LLM provider from UI selection (overrides .env)
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
+    logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
                 f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
             )
+        # Filter by specific task IDs if provided (overrides question limit)
+        if task_ids and task_ids.strip():
+            target_ids = [tid.strip() for tid in task_ids.split(",")]
+            original_count = len(questions_data)
+            questions_data = [
+                q for q in questions_data if q.get("task_id") in target_ids
+            ]
+            found_ids = [q.get("task_id") for q in questions_data]
+            missing_ids = set(target_ids) - set(found_ids)
+            if missing_ids:
+                logger.warning(f"Task IDs not found: {missing_ids}")
+            logger.warning(
+                f"DEBUG MODE: Targeted {len(questions_data)}/{original_count} questions by task_id"
+            )
+            print(
+                f"DEBUG MODE: Processing {len(questions_data)} targeted questions "
+                f"({len(missing_ids)} IDs not found: {missing_ids})"
+            )
         print(f"Processing {len(questions_data)} questions.")
     except requests.exceptions.RequestException as e:
         print(f"Error fetching questions: {e}")
             result_entry = {
                 "Task ID": result["task_id"],
                 "Question": result["question"],
+                "System Error": result.get("system_error", "no"),
+                "Submitted Answer": ""
+                if result.get("system_error") == "yes"
+                else result["answer"],
             }
+            # Add error log if system error
+            if result.get("system_error") == "yes" and result.get("error_log"):
+                result_entry["Error Log"] = result["error_log"]
             # Add ground truth data if available
             if is_correct is not None:
                 result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
             results_log.append(result_entry)
+            # Add to submission payload if no system error
+            if result.get("system_error") == "no":
                 answers_payload.append(
                     {"task_id": result["task_id"], "submitted_answer": result["answer"]}
                 )
                     value="HuggingFace",
                     info="Select which LLM to use for this test",
                 )
             test_button = gr.Button("Run Test", variant="primary")
                 inputs=[
                     test_question_input,
                     llm_provider_dropdown,
                 ],
                 outputs=[test_answer_output, test_diagnostics_output, test_api_status],
             )
                     value="HuggingFace",
                     info="Select which LLM to use for all questions",
                 )
                 eval_question_limit = gr.Number(
                     label="Question Limit (Debug)",
                     value=0,
                     info="Limit questions for testing (0 = process all)",
                 )
+            with gr.Row():
+                eval_task_ids = gr.Textbox(
+                    label="Target Task IDs (Debug)",
+                    value="",
+                    placeholder="task_id1, task_id2, ...",
+                    info="Comma-separated task IDs to run (overrides question limit)",
+                    lines=1,
+                )
             run_button = gr.Button("Run Evaluation & Submit All Answers")
             status_output = gr.Textbox(
                 fn=run_and_submit_all,
                 inputs=[
                     eval_llm_provider_dropdown,
                     eval_question_limit,
+                    eval_task_ids,
                 ],
                 outputs=[status_output, results_table, export_output],
             )

docs/gaia_submission_guide.md ADDED Viewed

	@@ -0,0 +1,120 @@

+# GAIA Submission Guide
+## Two Different Leaderboards
+### 1. Course Leaderboard (CURRENT - Course Assignment)
+**API Endpoint:** `https://agents-course-unit4-scoring.hf.space`
+**Purpose:** Hugging Face Agents Course Unit 4 assignment
+**Dataset:** 20 questions from GAIA validation set (level 1), filtered by tools/steps complexity
+**Target Score:** 30% = **6/20 correct**
+**API Routes:**
+- `GET /questions` - Retrieve full list of evaluation questions
+- `GET /random-question` - Fetch single random question
+- `GET /files/{task_id}` - Download file associated with task
+- `POST /submit` - Submit answers for scoring
+**Submission Format:**
+```json
+{
+  "username": "your-hf-username",
+  "agent_code": "https://huggingface.co/spaces/your-username/your-space/tree/main",
+  "answers": [
+    {"task_id": "...", "submitted_answer": "..."}
+  ]
+}
+```
+**Scoring:** EXACT MATCH with ground truth
+- Answer should be plain text, NO "FINAL ANSWER:" prefix
+- Answer should be precise and well-formatted
+**Debugging Features (Course-Specific):**
+- ✅ "Target Task IDs" - Run specific questions for debugging
+- ✅ "Question Limit" - Run first N questions for testing
+- ✅ Course API is forgiving for development iteration
+**Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
+---
+### 2. Official GAIA Leaderboard (FUTURE - Not Yet Implemented)
+**Space:** https://huggingface.co/spaces/gaia-benchmark/leaderboard
+**Purpose:** Official GAIA benchmark for AI research community
+**Dataset:** Full GAIA benchmark (450+ questions across 3 levels)
+**Submission Format:** File upload (JSON) with model metadata
+- Model name, family, parameters
+- Complete answers for ALL questions
+- Different evaluation process
+**Status:** ⚠️ **FUTURE DEVELOPMENT** - Not implemented in this template
+**Differences from Course:**
+| Aspect | Course | Official GAIA |
+|--------|--------|--------------|
+| Dataset Size | 20 questions | 450+ questions |
+| Submission Method | API POST | File upload |
+| Question Filtering | Allowed for debugging | Must submit ALL |
+| Scoring | Exact match | TBC (likely more flexible) |
+**Documentation:** https://huggingface.co/datasets/gaia-benchmark/GAIA
+---
+## Implementation Notes
+### Current Implementation Status
+**✅ Implemented:**
+- Course API integration (`/questions`, `/submit`, `/files/{task_id}`)
+- Agent execution with LangGraph StateGraph
+- OAuth login integration
+- Debug features (Target Task IDs, Question Limit)
+- Results export (JSON format)
+**⚠️ Course Constraints:**
+- Only 20 level 1 questions
+- Exact match scoring (strict)
+- Agent code must be public
+**🔮 Future Work (Official GAIA):**
+- File-based submission format
+- Full 450+ question support
+- Leaderboard-specific metadata
+- Official evaluation pipeline
+---
+## Development Workflow
+### For Course Assignment:
+1. **Develop:** Use "Target Task IDs" to test specific questions
+2. **Debug:** Use "Question Limit" for quick iteration
+3. **Test:** Run full evaluation on all 20 questions
+4. **Submit:** Course API evaluates exact match score
+5. **Iterate:** Improve prompts, tools, reasoning
+### For Official GAIA (Future):
+1. **Generate:** Create submission JSON with all 450+ answers
+2. **Format:** Follow official GAIA format requirements
+3. **Upload:** Submit via gaia-benchmark/leaderboard Space
+4. **Evaluate:** Official benchmark evaluation
+---
+## References
+- **Course Documentation:** https://huggingface.co/learn/agents-course/en/unit4/hands-on
+- **Course Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
+- **Official GAIA Dataset:** https://huggingface.co/datasets/gaia-benchmark/GAIA
+- **Official GAIA Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/leaderboard

src/agent/graph.py CHANGED Viewed

@@ -15,6 +15,7 @@ Based on:
 import logging
 import os
 from typing import TypedDict, List, Optional
 from langgraph.graph import StateGraph, END
 from src.config import Settings
@@ -100,9 +101,12 @@ def validate_environment() -> List[str]:
 # ============================================================================
-def fallback_tool_selection(question: str, plan: str) -> List[dict]:
     """
     MVP Fallback: Simple keyword-based tool selection when LLM fails.
     This is a temporary hack to get basic functionality working.
     Uses simple keyword matching to select tools.
@@ -110,6 +114,7 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
     Args:
         question: The user question
         plan: The execution plan
     Returns:
         List of tool calls with basic parameters
@@ -147,17 +152,37 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
             })
             logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
-    # File tool: keywords like "file", "parse", "read", "csv", "json", "txt"
-    file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
-    if any(keyword in combined for keyword in file_keywords):
-        # Cannot extract filename without more info, skip for now
-        logger.warning("[fallback_tool_selection] File operation detected but cannot extract filename")
     # Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
     image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
     if any(keyword in combined for keyword in image_keywords):
-        # Cannot extract image path without more info, skip for now
-        logger.warning("[fallback_tool_selection] Image operation detected but cannot extract image path")
     if not tool_calls:
         logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
@@ -256,7 +281,10 @@ def execute_node(state: AgentState) -> AgentState:
         # Stage 3: Use LLM function calling to select tools and extract parameters
         logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
         tool_calls = select_tools_with_function_calling(
-            question=state["question"], plan=state["plan"], available_tools=TOOLS
         )
         # Validate tool_calls result
@@ -264,13 +292,17 @@ def execute_node(state: AgentState) -> AgentState:
             logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
             state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
             # MVP HACK: Use fallback keyword-based tool selection
-            tool_calls = fallback_tool_selection(state["question"], state["plan"])
             logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
         elif not isinstance(tool_calls, list):
             logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
             state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
             # MVP HACK: Use fallback
-            tool_calls = fallback_tool_selection(state["question"], state["plan"])
         else:
             logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
             logger.debug(f"[execute_node] Tool calls: {tool_calls}")
@@ -305,8 +337,32 @@ def execute_node(state: AgentState) -> AgentState:
                     }
                 )
-                # Extract evidence
-                evidence.append(f"[{tool_name}] {result}")
             except Exception as tool_error:
                 logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
@@ -342,7 +398,9 @@ def execute_node(state: AgentState) -> AgentState:
         if not tool_calls:
             logger.info(f"[execute_node] Attempting fallback after exception...")
             try:
-                tool_calls = fallback_tool_selection(state["question"], state.get("plan", ""))
                 logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
                 # Try to execute fallback tools
@@ -367,7 +425,28 @@ def execute_node(state: AgentState) -> AgentState:
                                 "result": result,
                                 "status": "success"
                             })
-                            evidence.append(f"[{tool_name}] {result}")
                             logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
                     except Exception as tool_error:
                         logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
@@ -504,22 +583,26 @@ class GAIAAgent:
         self.last_state = None  # Store last execution state for diagnostics
         print("GAIAAgent initialized successfully")
-    def __call__(self, question: str) -> str:
         """
         Process question and return answer.
         Args:
             question: GAIA question text
         Returns:
             Factoid answer string
         """
         print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
         # Initialize state
         initial_state: AgentState = {
             "question": question,
-            "file_paths": None,
             "plan": None,
             "tool_calls": [],
             "tool_results": [],

 import logging
 import os
+from pathlib import Path
 from typing import TypedDict, List, Optional
 from langgraph.graph import StateGraph, END
 from src.config import Settings
 # ============================================================================
+def fallback_tool_selection(
+    question: str, plan: str, file_paths: Optional[List[str]] = None
+) -> List[dict]:
     """
     MVP Fallback: Simple keyword-based tool selection when LLM fails.
+    Enhanced to use actual file paths when available.
     This is a temporary hack to get basic functionality working.
     Uses simple keyword matching to select tools.
     Args:
         question: The user question
         plan: The execution plan
+        file_paths: Optional list of downloaded file paths
     Returns:
         List of tool calls with basic parameters
             })
             logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
+    # File tool: if file_paths available, use them
+    if file_paths:
+        for file_path in file_paths:
+            # Determine file type and appropriate tool
+            file_ext = Path(file_path).suffix.lower()
+            if file_ext in ['.png', '.jpg', '.jpeg']:
+                tool_calls.append({
+                    "tool": "vision",
+                    "params": {"image_path": file_path}
+                })
+                logger.info(f"[fallback_tool_selection] Added vision tool for image: {file_path}")
+            elif file_ext in ['.pdf', '.xlsx', '.xls', '.csv', '.json', '.txt', '.docx', '.doc']:
+                tool_calls.append({
+                    "tool": "parse_file",
+                    "params": {"file_path": file_path}
+                })
+                logger.info(f"[fallback_tool_selection] Added parse_file tool for: {file_path}")
+    else:
+        # Keyword-based file detection (legacy)
+        file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
+        if any(keyword in combined for keyword in file_keywords):
+            logger.warning("[fallback_tool_selection] File operation detected but no file_paths available")
     # Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
     image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
     if any(keyword in combined for keyword in image_keywords):
+        if file_paths:
+            # Already handled above in file_paths check
+            pass
+        else:
+            logger.warning("[fallback_tool_selection] Image operation detected but no file_paths available")
     if not tool_calls:
         logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
         # Stage 3: Use LLM function calling to select tools and extract parameters
         logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
         tool_calls = select_tools_with_function_calling(
+            question=state["question"],
+            plan=state["plan"],
+            available_tools=TOOLS,
+            file_paths=state.get("file_paths"),
         )
         # Validate tool_calls result
             logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
             state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
             # MVP HACK: Use fallback keyword-based tool selection
+            tool_calls = fallback_tool_selection(
+                state["question"], state["plan"], state.get("file_paths")
+            )
             logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
         elif not isinstance(tool_calls, list):
             logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
             state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
             # MVP HACK: Use fallback
+            tool_calls = fallback_tool_selection(
+                state["question"], state["plan"], state.get("file_paths")
+            )
         else:
             logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
             logger.debug(f"[execute_node] Tool calls: {tool_calls}")
                     }
                 )
+                # Extract evidence - handle different result formats
+                if isinstance(result, dict):
+                    # Vision tool returns {"answer": "..."}
+                    if "answer" in result:
+                        evidence.append(result["answer"])
+                    # Search tools return {"results": [...], "source": "...", "query": "..."}
+                    elif "results" in result:
+                        # Format search results as readable text
+                        results_list = result.get("results", [])
+                        if results_list:
+                            # Take first 3 results and format them
+                            formatted = []
+                            for r in results_list[:3]:
+                                title = r.get("title", "")[:100]
+                                url = r.get("url", "")[:100]
+                                snippet = r.get("snippet", "")[:200]
+                                formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
+                            evidence.append("\n\n".join(formatted))
+                        else:
+                            evidence.append(str(result))
+                    else:
+                        evidence.append(str(result))
+                elif isinstance(result, str):
+                    evidence.append(result)
+                else:
+                    evidence.append(str(result))
             except Exception as tool_error:
                 logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
         if not tool_calls:
             logger.info(f"[execute_node] Attempting fallback after exception...")
             try:
+                tool_calls = fallback_tool_selection(
+                    state["question"], state.get("plan", ""), state.get("file_paths")
+                )
                 logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
                 # Try to execute fallback tools
                                 "result": result,
                                 "status": "success"
                             })
+                            # Extract evidence - handle different result formats
+                            if isinstance(result, dict):
+                                if "answer" in result:
+                                    evidence.append(result["answer"])
+                                elif "results" in result:
+                                    results_list = result.get("results", [])
+                                    if results_list:
+                                        formatted = []
+                                        for r in results_list[:3]:
+                                            title = r.get("title", "")[:100]
+                                            url = r.get("url", "")[:100]
+                                            snippet = r.get("snippet", "")[:200]
+                                            formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
+                                        evidence.append("\n\n".join(formatted))
+                                    else:
+                                        evidence.append(str(result))
+                                else:
+                                    evidence.append(str(result))
+                            elif isinstance(result, str):
+                                evidence.append(result)
+                            else:
+                                evidence.append(str(result))
                             logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
                     except Exception as tool_error:
                         logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
         self.last_state = None  # Store last execution state for diagnostics
         print("GAIAAgent initialized successfully")
+    def __call__(self, question: str, file_path: Optional[str] = None) -> str:
         """
         Process question and return answer.
+        Supports optional file attachment for file-based questions.
         Args:
             question: GAIA question text
+            file_path: Optional path to downloaded file attachment
         Returns:
             Factoid answer string
         """
         print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
+        if file_path:
+            print(f"GAIAAgent processing file: {file_path}")
         # Initialize state
         initial_state: AgentState = {
             "question": question,
+            "file_paths": [file_path] if file_path else None,
             "plan": None,
             "tool_calls": [],
             "tool_results": [],

src/agent/llm_client.py CHANGED Viewed

@@ -158,7 +158,10 @@ def _get_provider_function(function_name: str, provider: str) -> Callable:
 def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
     """
-    Call LLM function with configured provider and optional fallback.
     Args:
         function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
@@ -168,55 +171,28 @@ def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
         Result from LLM call
     Raises:
-        Exception: If selected provider fails and fallback disabled, or all providers fail
     """
     # Read config at runtime for UI flexibility
     primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
-    enable_fallback = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
-    # Define fallback order (excluding primary provider)
-    all_providers = ["gemini", "huggingface", "groq", "claude"]
-    fallback_providers = [p for p in all_providers if p != primary_provider]
-    # Try primary provider first
     try:
         primary_func = _get_provider_function(function_name, primary_provider)
-        logger.info(f"[{function_name}] Using primary provider: {primary_provider}")
         return retry_with_backoff(lambda: primary_func(*args, **kwargs))
     except Exception as primary_error:
-        logger.warning(
-            f"[{function_name}] Primary provider {primary_provider} failed: {primary_error}"
         )
-        # If fallback disabled, raise immediately
-        if not enable_fallback:
-            logger.error(f"[{function_name}] Fallback disabled. Failing fast.")
-            raise Exception(
-                f"{function_name} failed with {primary_provider}: {primary_error}. "
-                f"Fallback disabled (ENABLE_LLM_FALLBACK=false)"
-            )
-        # Try fallback providers in order
-        errors = {primary_provider: primary_error}
-        for fallback_provider in fallback_providers:
-            try:
-                fallback_func = _get_provider_function(function_name, fallback_provider)
-                logger.info(
-                    f"[{function_name}] Trying fallback provider: {fallback_provider}"
-                )
-                return retry_with_backoff(lambda: fallback_func(*args, **kwargs))
-            except Exception as fallback_error:
-                errors[fallback_provider] = fallback_error
-                logger.warning(
-                    f"[{function_name}] Fallback provider {fallback_provider} failed: {fallback_error}"
-                )
-                continue
-        # All providers failed
-        error_summary = ", ".join([f"{k}: {v}" for k, v in errors.items()])
-        logger.error(f"[{function_name}] All providers failed. {error_summary}")
-        raise Exception(f"{function_name} failed with all providers. {error_summary}")
 # ============================================================================
 # Client Initialization
@@ -560,7 +536,7 @@ def plan_question(
 def select_tools_claude(
-    question: str, plan: str, available_tools: Dict[str, Dict]
 ) -> List[Dict[str, Any]]:
     """Use Claude function calling to select tools and extract parameters."""
     client = create_claude_client()
@@ -580,15 +556,28 @@ def select_tools_claude(
             }
         )
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
-- "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
-- "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
 Execute the plan step by step. Extract correct parameters from the question.
 Plan:
 {plan}"""
@@ -633,7 +622,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
 def select_tools_gemini(
-    question: str, plan: str, available_tools: Dict[str, Dict]
 ) -> List[Dict[str, Any]]:
     """Use Gemini function calling to select tools and extract parameters."""
     model = create_gemini_client()
@@ -665,15 +654,28 @@ def select_tools_gemini(
             )
         )
     prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
-- "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
-- "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
 Execute the plan step by step. Extract correct parameters from the question.
 Plan:
 {plan}
@@ -718,7 +720,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
 def select_tools_hf(
-    question: str, plan: str, available_tools: Dict[str, Dict]
 ) -> List[Dict[str, Any]]:
     """Use HuggingFace Inference API with function calling to select tools and extract parameters."""
     client = create_hf_client()
@@ -748,15 +750,28 @@ def select_tools_hf(
         tools.append(tool_schema)
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
-- "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
-- "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
 Execute the plan step by step. Extract correct parameters from the question.
 Plan:
 {plan}"""
@@ -766,7 +781,7 @@ Plan:
 Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
     logger.info(
-        f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools"
     )
     messages = [
@@ -807,7 +822,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
 def select_tools_groq(
-    question: str, plan: str, available_tools: Dict[str, Dict]
 ) -> List[Dict[str, Any]]:
     """Use Groq with function calling to select tools and extract parameters."""
     client = create_groq_client()
@@ -837,15 +852,28 @@ def select_tools_groq(
         tools.append(tool_schema)
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
-- "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
-- "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
 Execute the plan step by step. Extract correct parameters from the question.
 Plan:
 {plan}"""
@@ -900,7 +928,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
 def select_tools_with_function_calling(
-    question: str, plan: str, available_tools: Dict[str, Dict]
 ) -> List[Dict[str, Any]]:
     """
     Use LLM function calling to dynamically select tools and extract parameters.
@@ -913,11 +941,12 @@ def select_tools_with_function_calling(
         question: GAIA question text
         plan: Execution plan from planning phase
         available_tools: Tool registry
     Returns:
         List of tool calls with extracted parameters
     """
-    return _call_with_fallback("select_tools", question, plan, available_tools)
 # ============================================================================

 def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
     """
+    Call LLM function with configured provider.
+    NOTE: Fallback mechanism has been archived to reduce complexity.
+    Only the primary provider is used. If it fails, the error is raised directly.
     Args:
         function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
         Result from LLM call
     Raises:
+        Exception: If primary provider fails
     """
     # Read config at runtime for UI flexibility
     primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
+    # ============================================================================
+    # ARCHIVED: Fallback mechanism removed to reduce complexity
+    # Original fallback code was at: dev/dev_260112_02_fallback_archived.md
+    # To restore: Check git history or archived dev file
+    # ============================================================================
+    # Try primary provider only (no fallback)
     try:
         primary_func = _get_provider_function(function_name, primary_provider)
+        logger.info(f"[{function_name}] Using provider: {primary_provider}")
         return retry_with_backoff(lambda: primary_func(*args, **kwargs))
     except Exception as primary_error:
+        logger.error(f"[{function_name}] Provider {primary_provider} failed: {primary_error}")
+        raise Exception(
+            f"{function_name} failed with {primary_provider}: {primary_error}"
         )
 # ============================================================================
 # Client Initialization
 def select_tools_claude(
+    question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
 ) -> List[Dict[str, Any]]:
     """Use Claude function calling to select tools and extract parameters."""
     client = create_claude_client()
             }
         )
+    # File context for tool selection
+    file_context = ""
+    if file_paths:
+        file_context = f"""
+IMPORTANT: These files are available for this question:
+{chr(10).join(f"- {fp}" for fp in file_paths)}
+When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
+For vision tools with images: vision(image_path="<actual_file_path>")
+For file parsing tools: parse_file(file_path="<actual_file_path>")"""
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
+- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
+- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
 Execute the plan step by step. Extract correct parameters from the question.
+Use actual file paths when files are provided.{file_context}
 Plan:
 {plan}"""
 def select_tools_gemini(
+    question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
 ) -> List[Dict[str, Any]]:
     """Use Gemini function calling to select tools and extract parameters."""
     model = create_gemini_client()
             )
         )
+    # File context for tool selection
+    file_context = ""
+    if file_paths:
+        file_context = f"""
+IMPORTANT: These files are available for this question:
+{chr(10).join(f"- {fp}" for fp in file_paths)}
+When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
+For vision tools with images: vision(image_path="<actual_file_path>")
+For file parsing tools: parse_file(file_path="<actual_file_path>")"""
     prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
+- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
+- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
 Execute the plan step by step. Extract correct parameters from the question.
+Use actual file paths when files are provided.{file_context}
 Plan:
 {plan}
 def select_tools_hf(
+    question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
 ) -> List[Dict[str, Any]]:
     """Use HuggingFace Inference API with function calling to select tools and extract parameters."""
     client = create_hf_client()
         tools.append(tool_schema)
+    # File context for tool selection
+    file_context = ""
+    if file_paths:
+        file_context = f"""
+IMPORTANT: These files are available for this question:
+{chr(10).join(f"- {fp}" for fp in file_paths)}
+When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
+For vision tools with images: vision(image_path="<actual_file_path>")
+For file parsing tools: parse_file(file_path="<actual_file_path>")"""
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
+- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
+- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
 Execute the plan step by step. Extract correct parameters from the question.
+Use actual file paths when files are provided.{file_context}
 Plan:
 {plan}"""
 Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
     logger.info(
+        f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools, file_paths={file_paths}"
     )
     messages = [
 def select_tools_groq(
+    question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
 ) -> List[Dict[str, Any]]:
     """Use Groq with function calling to select tools and extract parameters."""
     client = create_groq_client()
         tools.append(tool_schema)
+    # File context for tool selection
+    file_context = ""
+    if file_paths:
+        file_context = f"""
+IMPORTANT: These files are available for this question:
+{chr(10).join(f"- {fp}" for fp in file_paths)}
+When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
+For vision tools with images: vision(image_path="<actual_file_path>")
+For file parsing tools: parse_file(file_path="<actual_file_path>")"""
     system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
 Few-shot examples:
 - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
 - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
+- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
+- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
 Execute the plan step by step. Extract correct parameters from the question.
+Use actual file paths when files are provided.{file_context}
 Plan:
 {plan}"""
 def select_tools_with_function_calling(
+    question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
 ) -> List[Dict[str, Any]]:
     """
     Use LLM function calling to dynamically select tools and extract parameters.
         question: GAIA question text
         plan: Execution plan from planning phase
         available_tools: Tool registry
+        file_paths: Optional list of downloaded file paths for file-based questions
     Returns:
         List of tool calls with extracted parameters
     """
+    return _call_with_fallback("select_tools", question, plan, available_tools, file_paths)
 # ============================================================================

src/tools/calculator.py CHANGED Viewed

@@ -93,20 +93,33 @@ def timeout(seconds: int):
     Raises:
         TimeoutError: If execution exceeds timeout
     """
     def timeout_handler(signum, frame):
         raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
-    # Set signal handler
-    old_handler = signal.signal(signal.SIGALRM, timeout_handler)
-    signal.alarm(seconds)
     try:
         yield
     finally:
         # Restore old handler and cancel alarm
-        signal.alarm(0)
-        signal.signal(signal.SIGALRM, old_handler)
 # ============================================================================

     Raises:
         TimeoutError: If execution exceeds timeout
+    Note:
+        signal.alarm() only works in main thread. In threaded contexts
+        (Gradio, ThreadPoolExecutor), timeout protection is disabled.
     """
     def timeout_handler(signum, frame):
         raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
+    try:
+        # Set signal handler (only works in main thread)
+        old_handler = signal.signal(signal.SIGALRM, timeout_handler)
+        signal.alarm(seconds)
+        _alarm_set = True
+    except (ValueError, AttributeError):
+        # ValueError: signal.alarm() in non-main thread
+        # AttributeError: signal.SIGALRM not available (Windows)
+        logger.warning(f"Timeout protection disabled (threading/Windows limitation)")
+        _alarm_set = False
+        old_handler = None
     try:
         yield
     finally:
         # Restore old handler and cancel alarm
+        if _alarm_set and old_handler is not None:
+            signal.alarm(0)
+            signal.signal(signal.SIGALRM, old_handler)
 # ============================================================================

test/test_quick_fixes.py ADDED Viewed

	@@ -0,0 +1,162 @@

+#!/usr/bin/env python3
+"""
+Quick test script for specific GAIA questions.
+Use this to verify fixes without running full evaluation.
+Usage:
+    uv run python test/test_quick_fixes.py
+"""
+import os
+import sys
+# Add project root to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from src.agent.graph import GAIAAgent
+from dotenv import load_dotenv
+# Load environment variables
+load_dotenv()
+# ============================================================================
+# CONFIG - Questions to test
+# ============================================================================
+TEST_QUESTIONS = [
+    {
+        "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
+        "name": "Reverse sentence (calculator threading fix)",
+        "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
+        "expected": "Right",
+    },
+    {
+        "task_id": "6f37996b-2ac7-44b0-8e68-6d28256631b4",
+        "name": "Table commutativity (LLM issue - table in question)",
+        "question": '''Given this table defining * on the set S = {a, b, c, d, e}
+|*|a|b|c|d|e|
+|---|---|---|---|---|
+|a|a|b|c|b|d|
+|b|b|c|a|e|c|
+|c|c|a|b|b|a|
+|d|b|e|b|e|d|
+|e|d|b|a|d|c|
+provide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.''',
+        "expected": "b, e",
+    },
+]
+# ============================================================================
+def test_question(agent: GAIAAgent, test_case: dict) -> dict:
+    """Test a single question and return result."""
+    task_id = test_case["task_id"]
+    question = test_case["question"]
+    expected = test_case.get("expected", "N/A")
+    print(f"\n{'='*60}")
+    print(f"Testing: {test_case['name']}")
+    print(f"Task ID: {task_id}")
+    print(f"Expected: {expected}")
+    print(f"{'='*60}")
+    try:
+        answer = agent(question, file_path=None)
+        # Check if answer matches expected
+        is_correct = answer.strip().lower() == expected.lower().strip()
+        result = {
+            "task_id": task_id,
+            "name": test_case["name"],
+            "question": question[:100] + "..." if len(question) > 100 else question,
+            "expected": expected,
+            "answer": answer,
+            "correct": is_correct,
+            "status": "success",
+        }
+        # Determine system error
+        if not answer:
+            result["system_error"] = "yes"
+        elif answer.lower().startswith("error:") or "no evidence collected" in answer.lower():
+            result["system_error"] = "yes"
+            result["error_log"] = answer
+        else:
+            result["system_error"] = "no"
+    except Exception as e:
+        result = {
+            "task_id": task_id,
+            "name": test_case["name"],
+            "question": question[:100] + "..." if len(question) > 100 else question,
+            "expected": expected,
+            "answer": f"ERROR: {str(e)}",
+            "correct": False,
+            "status": "error",
+            "system_error": "yes",
+            "error_log": str(e),
+        }
+    # Print result
+    status_icon = "✅" if result["correct"] else "❌" if result["system_error"] == "no" else "⚠️"
+    print(f"\n{status_icon} Result: {result['answer'][:100]}")
+    if result["system_error"] == "yes":
+        print(f"   System Error: Yes")
+        if result.get("error_log"):
+            print(f"   Error: {result['error_log'][:100]}")
+    return result
+def main():
+    """Run quick tests on specific questions."""
+    print("\n" + "="*60)
+    print("GAIA Quick Test - Verify Fixes")
+    print("="*60)
+    # Check LLM provider
+    llm_provider = os.getenv("LLM_PROVIDER", "gemini")
+    print(f"\nLLM Provider: {llm_provider}")
+    # Initialize agent
+    print("\nInitializing agent...")
+    try:
+        agent = GAIAAgent()
+        print("✅ Agent initialized")
+    except Exception as e:
+        print(f"❌ Agent initialization failed: {e}")
+        return
+    # Run tests
+    results = []
+    for test_case in TEST_QUESTIONS:
+        result = test_question(agent, test_case)
+        results.append(result)
+    # Summary
+    print(f"\n{'='*60}")
+    print("SUMMARY")
+    print(f"{'='*60}")
+    success_count = sum(1 for r in results if r["correct"])
+    error_count = sum(1 for r in results if r["system_error"] == "yes")
+    ai_fail_count = sum(1 for r in results if r["system_error"] == "no" and not r["correct"])
+    print(f"\nTotal: {len(results)}")
+    print(f"✅ Correct: {success_count}")
+    print(f"⚠️  System Errors: {error_count}")
+    print(f"❌ AI Wrong: {ai_fail_count}")
+    # Detailed results
+    print(f"\nDetailed Results:")
+    for r in results:
+        status = "✅" if r["correct"] else "⚠️" if r["system_error"] == "yes" else "❌"
+        print(f"  {status} {r['name']}: {r['answer'][:50]}{'...' if len(r['answer']) > 50 else ''}")
+if __name__ == "__main__":
+    main()