mangubee Claude commited on
Commit
41ac444
·
1 Parent(s): 7d0cc73

feat: system error field, calculator fix, target task IDs, course vs GAIA docs

Browse files

Changes:
- System error field: Changed to boolean yes/no with error_log
- Calculator threading fix: Handle signal.alarm() failure in threads
- Target task IDs: Debug feature to run specific questions
- Course vs GAIA: Documentation distinguishing course API from official GAIA
- Quick test script: test/test_quick_fixes.py for targeted testing

Modified:
- app.py: System error field, target task IDs UI, submission logic
- src/tools/calculator.py: Thread-safe timeout handling
- src/agent/graph.py: Evidence formatting for dict results
- src/agent/llm_client.py: Fallback mechanism archived
- CHANGELOG.md: All changes documented
- README.md: Added submission guide reference

Added:
- docs/gaia_submission_guide.md: Complete submission guide
- test/test_quick_fixes.py: Targeted question testing

Co-Authored-By: Claude <noreply@anthropic.com>

.gitignore CHANGED
@@ -29,6 +29,10 @@ Thumbs.db
29
 
30
  # Input documents (PDFs not allowed in HF Spaces)
31
  input/*.pdf
 
 
 
 
32
 
33
  # Runtime cache (not in git, served via app download)
34
  _cache/
 
29
 
30
  # Input documents (PDFs not allowed in HF Spaces)
31
  input/*.pdf
32
+ input/
33
+
34
+ # Downloaded GAIA question files
35
+ input/*
36
 
37
  # Runtime cache (not in git, served via app download)
38
  _cache/
CHANGELOG.md CHANGED
@@ -1,5 +1,283 @@
1
  # Session Changelog
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
4
 
5
  **Problem:** Need to validate HF vision works before complex GAIA evaluation.
 
1
  # Session Changelog
2
 
3
+ ## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
4
+
5
+ **Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
6
+
7
+ **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
8
+
9
+ **Solution:** Created `docs/gaia_submission_guide.md` documenting:
10
+ - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
11
+ - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
12
+ - API routes, submission formats, scoring differences
13
+ - Development workflow for both
14
+
15
+ **Key Clarifications:**
16
+ | Aspect | Course | Official GAIA |
17
+ |--------|--------|--------------|
18
+ | API | `agents-course-unit4-scoring.hf.space` | `gaia-benchmark/leaderboard` Space |
19
+ | Questions | 20 (level 1) | 450+ (all levels) |
20
+ | Target | 30% (6/20) | Competitive placement |
21
+ | Debug features | Target Task IDs, Question Limit | Must submit ALL |
22
+ | Submission | JSON POST | File upload |
23
+
24
+ **Created Files:**
25
+ - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
26
+
27
+ **Modified Files:**
28
+ - **README.md** - Added note linking to submission guide
29
+
30
+ ---
31
+
32
+ ## [2026-01-12] [Feature] [COMPLETED] Target Specific Task IDs
33
+
34
+ **Problem:** No way to run specific questions for debugging. Had to run full evaluation or use "first N" limit, which is inefficient for targeted fixes.
35
+
36
+ **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
37
+
38
+ **Implementation:**
39
+ - Added `eval_task_ids` textbox in UI (line 763-770)
40
+ - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
41
+ - Filtering logic: Parses comma-separated IDs, filters `questions_data`
42
+ - Shows missing IDs warning if task_id not found in dataset
43
+ - Overrides question_limit when provided
44
+
45
+ **Usage:**
46
+ ```
47
+ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
48
+ ```
49
+
50
+ **Modified Files:**
51
+ - **app.py** (~30 lines added)
52
+ - UI: `eval_task_ids` textbox
53
+ - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
54
+ - `run_button.click()`: Pass task_ids to function
55
+
56
+ ---
57
+
58
+ ## [2026-01-12] [Bug Fix] [COMPLETED] Calculator Threading Issue
59
+
60
+ **Problem:** Calculator tool fails with `ValueError: signal only works in main thread of the main interpreter` when running in Gradio's ThreadPoolExecutor context.
61
+
62
+ **Root Cause:** `signal.alarm()` only works in the main thread. Our agent uses `ThreadPoolExecutor` for concurrent processing (max_workers=5).
63
+
64
+ **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
65
+
66
+ **Modified Files:**
67
+ - **src/tools/calculator.py** (~15 lines modified)
68
+ - `timeout()` context manager: Try/except for signal.alarm() failure
69
+ - Logs warning when timeout protection disabled
70
+ - Gracefully handles Windows (AttributeError for SIGALRM)
71
+
72
+ ---
73
+
74
+ ## [2026-01-12] [Feature] [COMPLETED] System Error Field
75
+
76
+ **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
77
+
78
+ **Solution:** Changed to boolean `system_error: yes/no` field:
79
+ - `system_error: yes` - Technical/system error from our code (don't submit)
80
+ - `system_error: no` - AI response (submit answer, even if wrong)
81
+ - Added `error_log` field with full error details for system errors
82
+
83
+ **Implementation:**
84
+ - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
85
+ - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
86
+ - JSON export: `system_error` field, `error_log` field (when system error)
87
+ - Submission logic: Only submit when `system_error == "no"`
88
+
89
+ **Modified Files:**
90
+ - **app.py** (~30 lines modified)
91
+ - `a_determine_status()`: Returns tuple instead of string
92
+ - `process_single_question()`: Uses new format, adds `error_log`
93
+ - Results table: "System Error" + "Error Log" columns
94
+ - `export_results_to_json()`: Include `system_error` and `error_log`
95
+
96
+ ---
97
+
98
+ ## [2026-01-12] [Refactoring] [COMPLETED] Fallback UI Removal
99
+
100
+ **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
101
+
102
+ **Solution:** Removed all fallback-related UI elements:
103
+ - Removed `enable_fallback_checkbox` from Test Question tab
104
+ - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
105
+ - Removed `enable_fallback` parameter from `test_single_question()` function
106
+ - Removed `enable_fallback` parameter from `run_and_submit_all()` function
107
+ - Removed `ENABLE_LLM_FALLBACK` environment variable setting
108
+ - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
109
+
110
+ **Modified Files:**
111
+ - **app.py** (~20 lines removed)
112
+ - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
113
+ - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
114
+ - Updated `test_button.click()` inputs to remove checkbox reference
115
+ - Updated `run_button.click()` inputs to remove checkbox reference
116
+
117
+ ---
118
+
119
+ ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
120
+
121
+ **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
122
+ - 4 providers to test for each feature
123
+ - Complex debugging with multiple code paths
124
+ - Longer, less clear error messages
125
+ - Adding complexity without clear benefit
126
+
127
+ **Solution:** Archive fallback mechanism, use single provider only
128
+ - Removed fallback provider loop (Gemini → HF → Groq → Claude)
129
+ - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
130
+ - If provider fails, error is raised immediately
131
+ - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
132
+
133
+ **Benefits:**
134
+ - ✅ Reduced code complexity
135
+ - ✅ Faster debugging (one code path)
136
+ - ✅ Clearer error messages
137
+ - ✅ No double work on features
138
+
139
+ **Modified Files:**
140
+ - **src/agent/llm_client.py** (~25 lines removed)
141
+ - Simplified `_call_with_fallback()`: Removed fallback logic
142
+ - **dev/dev_260112_02_fallback_archived.md** (NEW)
143
+ - Archived fallback code documentation
144
+ - Migration guide for restoration if needed
145
+
146
+ ---
147
+
148
+ ## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Search Results Not Being Extracted
149
+
150
+ **Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
151
+
152
+ **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
153
+ ```python
154
+ {"results": [...], "source": "tavily", "query": "...", "count": N}
155
+ ```
156
+
157
+ **Solution:** Handle both dict formats in evidence extraction:
158
+ ```python
159
+ if isinstance(result, dict):
160
+ if "answer" in result:
161
+ evidence.append(result["answer"]) # Vision tools
162
+ elif "results" in result:
163
+ # Format search results as readable text
164
+ results_list = result.get("results", [])
165
+ formatted = []
166
+ for r in results_list[:3]:
167
+ formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
168
+ evidence.append("\n\n".join(formatted)) # Search tools
169
+ ```
170
+
171
+ **Modified Files:**
172
+ - **src/agent/graph.py** (~40 lines modified)
173
+ - Updated evidence extraction in primary path
174
+ - Updated evidence extraction in fallback path
175
+
176
+ **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
177
+
178
+ **Summary of Fixes (Session 2026-01-12):**
179
+ 1. ✅ File download from HF dataset (5/5 files)
180
+ 2. ✅ Absolute paths from script location
181
+ 3. ✅ Evidence formatting for vision tools (dict → answer)
182
+ 4. ✅ Evidence formatting for search tools (dict → formatted text)
183
+
184
+ ---
185
+
186
+ ## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Dict Results Not Being Extracted
187
+
188
+ **Problem:** Chess vision question returned "Unable to answer" even though vision tool correctly extracted the chess position.
189
+
190
+ **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
191
+
192
+ **Solution:** Extract 'answer' field from dict results before adding to evidence:
193
+ ```python
194
+ # Before
195
+ evidence.append(f"[{tool_name}] {result}") # Dict → string representation
196
+
197
+ # After
198
+ if isinstance(result, dict) and "answer" in result:
199
+ evidence.append(result["answer"]) # Extract answer field
200
+ elif isinstance(result, str):
201
+ evidence.append(result)
202
+ ```
203
+
204
+ **Modified Files:**
205
+ - **src/agent/graph.py** (~15 lines modified)
206
+ - Updated `execute_node()`: Extract 'answer' from dict results
207
+ - Fixed both primary and fallback execution paths
208
+
209
+ **Test Result:** Simple search questions now work. Chess question still fails due to vision tool extracting wrong turn indicator (w instead of b).
210
+
211
+ **Known Issue:** Vision tool extracts "w - - 0 1" (White's turn) but question asks for Black's move. Ground truth is "Rd5" (Black move), so FEN extraction may have error.
212
+
213
+ ---
214
+
215
+ ## [2026-01-12] [File Download Fix] [COMPLETED] Absolute Path Fix - Vision Tool Now Works
216
+
217
+ **Problem:** Chess vision question returned "Unable to answer" even though file was downloaded successfully.
218
+
219
+ **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
220
+
221
+ **Solution:** Return absolute paths from `download_task_file()`
222
+ - Changed: `target_path = os.path.join(save_dir, file_name)`
223
+ - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
224
+ - Now tools can find files regardless of working directory
225
+
226
+ **Modified Files:**
227
+ - **app.py** (~3 lines modified)
228
+ - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
229
+
230
+ **Test Result:** Vision tool now works with absolute path - correctly analyzes chess position
231
+
232
+ ---
233
+
234
+ ## [2026-01-12] [File Download Fix] [COMPLETED] GAIA File API Dead End - Switch to HF Dataset
235
+
236
+ **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
237
+
238
+ **Investigation:**
239
+ - Checked API spec: Endpoint exists with proper documentation
240
+ - Tested download: HTTP 404 "No file path associated with task_id"
241
+ - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
242
+ - Confirmed via Swagger UI: Same 404 error
243
+
244
+ **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
245
+
246
+ **Solution:** Switch from evaluation API to GAIA dataset download
247
+ - Use `huggingface_hub.hf_hub_download()` to fetch files
248
+ - Download to `_cache/gaia_files/` (runtime cache)
249
+ - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
250
+ - Added cache checking (reuse downloaded files)
251
+
252
+ **Files with attachments (5/20 questions):**
253
+ - `cca530fc`: Chess position image (.png)
254
+ - `99c9cc74`: Pie recipe audio (.mp3)
255
+ - `f918266a`: Python code (.py)
256
+ - `1f975693`: Calculus audio (.mp3)
257
+ - `7bd855d8`: Menu sales Excel (.xlsx)
258
+
259
+ **Modified Files:**
260
+ - **app.py** (~70 lines modified)
261
+ - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
262
+ - Changed signature: `download_task_file(task_id, file_name, save_dir)`
263
+ - Added `huggingface_hub` import with cache checking
264
+ - Default directory: `_cache/gaia_files/` (runtime cache, not git)
265
+ - Flat file structure: `_cache/gaia_files/{file_name}`
266
+ - **app.py** (~5 lines modified)
267
+ - Updated `process_single_question()`: Pass `file_name` to download function
268
+
269
+ **Known Limitations:**
270
+ - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
271
+ - `.mp3` audio files still unsupported
272
+ - `.py` code execution still unsupported
273
+
274
+ **Next Steps:**
275
+ 1. Test new download implementation
276
+ 2. Expand tool support for .mp3 (audio transcription)
277
+ 3. Expand tool support for .py (code execution)
278
+
279
+ ---
280
+
281
  ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
282
 
283
  **Problem:** Need to validate HF vision works before complex GAIA evaluation.
PLAN.md CHANGED
@@ -531,3 +531,168 @@ If Phase 0 reveals HF Inference API doesn't support vision:
531
  2. Test simple vision API call with Phi-3.5-vision-instruct
532
  3. Document working pattern or confirm API doesn't support vision
533
  4. Decision gate: GO to Phase 1 or pivot to backup options
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
531
  2. Test simple vision API call with Phi-3.5-vision-instruct
532
  3. Document working pattern or confirm API doesn't support vision
533
  4. Decision gate: GO to Phase 1 or pivot to backup options
534
+
535
+ ---
536
+
537
+ ## Phase 7: GAIA File Attachment Support
538
+
539
+ **Goal:** Enable agent to download and process file attachments from GAIA questions
540
+
541
+ **Problem:**
542
+ - Current code ignores `file_name` field in GAIA questions
543
+ - Files not downloaded from `GET /files/{task_id}` endpoint
544
+ - Vision/file parsing tools fail with placeholder `<provided_image_path>`
545
+ - ~40% of questions (8/20) fail due to missing file handling
546
+
547
+ ### Root Cause
548
+
549
+ **GAIA Question Structure:**
550
+ ```json
551
+ {
552
+ "task_id": "abc123",
553
+ "question": "What's in this image?",
554
+ "file_name": "chess.png", // NULL if no file
555
+ "file_path": "/files/abc123" // NULL if no file
556
+ }
557
+ ```
558
+
559
+ **Current Code (app.py:249-290):**
560
+ ```python
561
+ def process_single_question(agent, item, index, total):
562
+ task_id = item.get("task_id")
563
+ question_text = item.get("question")
564
+ # ❌ MISSING: Check file_name
565
+ # ❌ MISSING: Download file
566
+ # ❌ MISSING: Pass file_path to agent
567
+
568
+ submitted_answer = agent(question_text) # No file handling
569
+ ```
570
+
571
+ **Result:** LLM generates `vision(image_path="<provided_image_path>")` → File not found error
572
+
573
+ ### Solution Architecture
574
+
575
+ **Step 1: Add File Download Function**
576
+
577
+ ```python
578
+ def download_task_file(task_id: str, save_dir: str = "input/") -> Optional[str]:
579
+ """Download file attached to a GAIA question.
580
+
581
+ Args:
582
+ task_id: Question's task_id
583
+ save_dir: Directory to save file
584
+
585
+ Returns:
586
+ File path if downloaded, None if no file
587
+ """
588
+ api_url = "https://agents-course-unit4-scoring.hf.space"
589
+ file_url = f"{api_url}/files/{task_id}"
590
+
591
+ response = requests.get(file_url, timeout=30)
592
+ response.raise_for_status()
593
+
594
+ # Get extension from Content-Type header
595
+ content_type = response.headers.get('Content-Type', '')
596
+ extension_map = {
597
+ 'image/png': '.png',
598
+ 'image/jpeg': '.jpg',
599
+ 'application/pdf': '.pdf',
600
+ 'text/csv': '.csv',
601
+ 'application/json': '.json',
602
+ 'application/vnd.ms-excel': '.xls',
603
+ 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': '.xlsx',
604
+ }
605
+ extension = extension_map.get(content_type, '.file')
606
+
607
+ # Save file
608
+ Path(save_dir).mkdir(exist_ok=True)
609
+ file_path = f"{save_dir}{task_id}{extension}"
610
+ with open(file_path, 'wb') as f:
611
+ f.write(response.content)
612
+
613
+ return file_path
614
+ ```
615
+
616
+ **Step 2: Modify Question Processing**
617
+
618
+ ```python
619
+ def process_single_question(agent, item, index, total):
620
+ task_id = item.get("task_id")
621
+ question_text = item.get("question")
622
+ file_name = item.get("file_name") # ✅ NEW
623
+
624
+ # Download file if exists
625
+ file_path = None
626
+ if file_name:
627
+ file_path = download_task_file(task_id)
628
+
629
+ # Pass file info to agent
630
+ submitted_answer = agent(question_text, file_path=file_path) # ✅ NEW
631
+ ```
632
+
633
+ **Step 3: Update LLM Context**
634
+
635
+ When file_path is provided, include it in the planning prompt:
636
+ ```python
637
+ if file_path:
638
+ question_context = f"Question: {question_text}\nAttached file: {file_path}"
639
+ else:
640
+ question_context = question_text
641
+ ```
642
+
643
+ ### Implementation Steps
644
+
645
+ #### Step 7.1: Add File Download Function
646
+
647
+ - [ ] Create `download_task_file()` in app.py
648
+ - [ ] Handle Content-Type to extension mapping
649
+ - [ ] Handle 404 gracefully (no file for this task)
650
+ - [ ] Create input/ directory if not exists
651
+
652
+ #### Step 7.2: Modify Question Processing Loop
653
+
654
+ - [ ] Check `item.get("file_name")` in process_single_question
655
+ - [ ] Call download_task_file() if file_name exists
656
+ - [ ] Pass file_path to agent invocation
657
+
658
+ #### Step 7.3: Update Agent to Handle file_path
659
+
660
+ - [ ] Modify agent to accept optional file_path parameter
661
+ - [ ] Include file info in planning prompt
662
+ - [ ] Update tool selection to use real file paths
663
+
664
+ #### Step 7.4: Test File Handling
665
+
666
+ - [ ] Test with image question (chess position)
667
+ - [ ] Test with document question (Excel file)
668
+ - [ ] Verify no more `<provided_image_path>` errors
669
+
670
+ ### Files to Modify
671
+
672
+ 1. **app.py** (~80 lines added/modified)
673
+ - Add download_task_file() function
674
+ - Modify process_single_question() to handle files
675
+ - Add input/ directory creation
676
+
677
+ 2. **src/agent/graph.py** (~20 lines)
678
+ - Update agent state to include file_path
679
+ - Pass file info to planning prompt
680
+
681
+ 3. **.gitignore** (~2 lines)
682
+ - Add input/ to ignore downloaded files
683
+
684
+ ### Success Criteria
685
+
686
+ - [ ] Image questions: Vision tool receives real file path
687
+ - [ ] Document questions: parse_file tool receives real file path
688
+ - [ ] No more `<provided_image_path>` errors
689
+ - [ ] Accuracy improves from 10% toward 30%+
690
+
691
+ ### Expected Impact
692
+
693
+ | Before | After |
694
+ |--------|-------|
695
+ | 40% (8/20) fail with file errors | 0% file errors |
696
+ | Vision questions: All fail | Vision questions: Can work |
697
+ | Document questions: All fail | Document questions: Can work |
698
+ | Max accuracy: ~60% | Max accuracy: ~100% potential |
README.md CHANGED
@@ -348,6 +348,8 @@ ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
348
 
349
  **Test Coverage:** 99 passing tests (~2min 40sec runtime)
350
 
 
 
351
  ## Workflow
352
 
353
  ### Dev Record Workflow
 
348
 
349
  **Test Coverage:** 99 passing tests (~2min 40sec runtime)
350
 
351
+ > **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
352
+
353
  ## Workflow
354
 
355
  ### Dev Record Workflow
app.py CHANGED
@@ -1,17 +1,18 @@
1
  import os
2
  import gradio as gr
3
  import requests
4
- import inspect
5
  import pandas as pd
6
  import logging
7
  import json
8
  import time
 
9
  from concurrent.futures import ThreadPoolExecutor, as_completed
10
 
11
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
12
  from src.agent import GAIAAgent
13
 
14
  # Import ground truth comparison
 
15
  from src.utils.ground_truth import get_ground_truth
16
 
17
  # Configure logging
@@ -99,9 +100,14 @@ def export_results_to_json(
99
  result_dict = {
100
  "task_id": result.get("Task ID", "N/A"),
101
  "question": result.get("Question", "N/A"),
 
102
  "submitted_answer": result.get("Submitted Answer", "N/A"),
103
  }
104
 
 
 
 
 
105
  # Add correctness if available
106
  if result.get("Correct?"):
107
  result_dict["correct"] = (
@@ -201,7 +207,81 @@ def format_diagnostics(final_state: dict) -> str:
201
  return "\n".join(diagnostics)
202
 
203
 
204
- def test_single_question(question: str, llm_provider: str, enable_fallback: bool):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
205
  """Test agent with a single question and return diagnostics."""
206
  if not question or not question.strip():
207
  return "Please enter a question.", "", check_api_keys()
@@ -209,11 +289,8 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
209
  try:
210
  # Set LLM provider from UI selection (overrides .env)
211
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
212
- os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
213
 
214
- logger.info(
215
- f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
216
- )
217
 
218
  # Initialize agent
219
  agent = GAIAAgent()
@@ -225,7 +302,7 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
225
  final_state = agent.last_state or {}
226
 
227
  # Format diagnostics with LLM provider info
228
- provider_info = f"**LLM Provider:** {llm_provider} (Fallback: {'Enabled' if enable_fallback else 'Disabled'})\n\n"
229
  diagnostics = provider_info + format_diagnostics(final_state)
230
  api_status = check_api_keys()
231
 
@@ -246,12 +323,34 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
246
  # Stage 6: Async processing with ThreadPoolExecutor
247
 
248
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
249
  def process_single_question(agent, item, index, total):
250
  """Process single question with agent, return result with error handling.
 
251
 
252
  Args:
253
  agent: GAIAAgent instance
254
- item: Question item dict with task_id and question
255
  index: Question index (0-based)
256
  total: Total number of questions
257
 
@@ -260,40 +359,64 @@ def process_single_question(agent, item, index, total):
260
  """
261
  task_id = item.get("task_id")
262
  question_text = item.get("question")
 
263
 
264
  if not task_id or question_text is None:
 
 
265
  return {
266
  "task_id": task_id,
267
  "question": question_text,
268
- "answer": "ERROR: Missing task_id or question",
 
 
269
  "error": True,
270
  }
271
 
 
 
 
 
 
 
 
 
 
272
  try:
273
  logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
274
- submitted_answer = agent(question_text)
 
 
 
275
  logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
276
 
 
277
  return {
278
  "task_id": task_id,
279
  "question": question_text,
280
  "answer": submitted_answer,
 
 
281
  "error": False,
282
  }
283
  except Exception as e:
284
  logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
 
 
285
  return {
286
  "task_id": task_id,
287
  "question": question_text,
288
- "answer": f"ERROR: {str(e)}",
 
 
289
  "error": True,
290
  }
291
 
292
 
293
  def run_and_submit_all(
294
  llm_provider: str,
295
- enable_fallback: bool,
296
  question_limit: int = 0,
 
297
  profile: gr.OAuthProfile | None = None,
298
  ):
299
  """
@@ -302,8 +425,8 @@ def run_and_submit_all(
302
 
303
  Args:
304
  llm_provider: LLM provider to use
305
- enable_fallback: Whether to enable fallback to other providers
306
  question_limit: Limit number of questions (0 = process all)
 
307
  profile: OAuth profile for HF login
308
  """
309
  # Start execution timer
@@ -325,10 +448,7 @@ def run_and_submit_all(
325
 
326
  # Set LLM provider from UI selection (overrides .env)
327
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
328
- os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
329
- logger.info(
330
- f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
331
- )
332
 
333
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
334
  try:
@@ -366,6 +486,27 @@ def run_and_submit_all(
366
  f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
367
  )
368
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369
  print(f"Processing {len(questions_data)} questions.")
370
  except requests.exceptions.RequestException as e:
371
  print(f"Error fetching questions: {e}")
@@ -421,9 +562,16 @@ def run_and_submit_all(
421
  result_entry = {
422
  "Task ID": result["task_id"],
423
  "Question": result["question"],
424
- "Submitted Answer": result["answer"],
 
 
 
425
  }
426
 
 
 
 
 
427
  # Add ground truth data if available
428
  if is_correct is not None:
429
  result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
@@ -433,8 +581,8 @@ def run_and_submit_all(
433
 
434
  results_log.append(result_entry)
435
 
436
- # Add to submission payload if no error
437
- if not result["error"]:
438
  answers_payload.append(
439
  {"task_id": result["task_id"], "submitted_answer": result["answer"]}
440
  )
@@ -575,11 +723,6 @@ with gr.Blocks() as demo:
575
  value="HuggingFace",
576
  info="Select which LLM to use for this test",
577
  )
578
- enable_fallback_checkbox = gr.Checkbox(
579
- label="Enable Fallback",
580
- value=False,
581
- info="If enabled, falls back to other providers on failure",
582
- )
583
 
584
  test_button = gr.Button("Run Test", variant="primary")
585
 
@@ -601,7 +744,6 @@ with gr.Blocks() as demo:
601
  inputs=[
602
  test_question_input,
603
  llm_provider_dropdown,
604
- enable_fallback_checkbox,
605
  ],
606
  outputs=[test_answer_output, test_diagnostics_output, test_api_status],
607
  )
@@ -632,11 +774,6 @@ with gr.Blocks() as demo:
632
  value="HuggingFace",
633
  info="Select which LLM to use for all questions",
634
  )
635
- eval_enable_fallback_checkbox = gr.Checkbox(
636
- label="Enable Fallback",
637
- value=True,
638
- info="Recommended: Enable fallback for production evaluation",
639
- )
640
  eval_question_limit = gr.Number(
641
  label="Question Limit (Debug)",
642
  value=0,
@@ -646,6 +783,15 @@ with gr.Blocks() as demo:
646
  info="Limit questions for testing (0 = process all)",
647
  )
648
 
 
 
 
 
 
 
 
 
 
649
  run_button = gr.Button("Run Evaluation & Submit All Answers")
650
 
651
  status_output = gr.Textbox(
@@ -660,8 +806,8 @@ with gr.Blocks() as demo:
660
  fn=run_and_submit_all,
661
  inputs=[
662
  eval_llm_provider_dropdown,
663
- eval_enable_fallback_checkbox,
664
  eval_question_limit,
 
665
  ],
666
  outputs=[status_output, results_table, export_output],
667
  )
 
1
  import os
2
  import gradio as gr
3
  import requests
 
4
  import pandas as pd
5
  import logging
6
  import json
7
  import time
8
+ from pathlib import Path
9
  from concurrent.futures import ThreadPoolExecutor, as_completed
10
 
11
  # Stage 1: Import GAIAAgent (LangGraph-based agent)
12
  from src.agent import GAIAAgent
13
 
14
  # Import ground truth comparison
15
+
16
  from src.utils.ground_truth import get_ground_truth
17
 
18
  # Configure logging
 
100
  result_dict = {
101
  "task_id": result.get("Task ID", "N/A"),
102
  "question": result.get("Question", "N/A"),
103
+ "system_error": result.get("System Error", "no"),
104
  "submitted_answer": result.get("Submitted Answer", "N/A"),
105
  }
106
 
107
+ # Add error log if system error
108
+ if result.get("System Error") == "yes" and result.get("Error Log"):
109
+ result_dict["error_log"] = result.get("Error Log")
110
+
111
  # Add correctness if available
112
  if result.get("Correct?"):
113
  result_dict["correct"] = (
 
207
  return "\n".join(diagnostics)
208
 
209
 
210
+ def download_task_file(
211
+ task_id: str, file_name: str, save_dir: str = "_cache/gaia_files/"
212
+ ):
213
+ """Download file attached to a GAIA question from the GAIA dataset on HuggingFace.
214
+
215
+ The evaluation API's /files/{task_id} endpoint returns 404 because files are not
216
+ hosted there. Files must be downloaded from the official GAIA dataset instead.
217
+
218
+ Files are cached in _cache/ directory (runtime cache, not in git).
219
+
220
+ Args:
221
+ task_id: Question's task_id (used for logging)
222
+ file_name: Original file name from API (e.g., "task_id.png")
223
+ save_dir: Directory to save file (created if not exists)
224
+
225
+ Returns:
226
+ File path if downloaded successfully, None if download failed
227
+ """
228
+ import shutil
229
+ from huggingface_hub import hf_hub_download
230
+ import tempfile
231
+
232
+ # GAIA dataset file structure: 2023/validation/{task_id}.{ext}
233
+ # Extract file extension from file_name
234
+ _, ext = os.path.splitext(file_name)
235
+ ext = ext.lower()
236
+
237
+ # Try validation set first (most questions are from validation)
238
+ repo_id = "gaia-benchmark/GAIA"
239
+ possible_paths = [
240
+ f"2023/validation/{task_id}{ext}",
241
+ f"2023/test/{task_id}{ext}",
242
+ ]
243
+
244
+ # Create save directory if not exists (relative to script location)
245
+ # Use script's directory as base to ensure paths work in all environments (local, HF Space)
246
+ script_dir = Path(__file__).parent.absolute()
247
+ cache_dir = script_dir / save_dir
248
+ cache_dir.mkdir(exist_ok=True, parents=True)
249
+ target_path = str(cache_dir / file_name)
250
+
251
+ # Check if file already exists in cache (use absolute path for check)
252
+ if os.path.exists(target_path):
253
+ logger.info(f"Using cached file for {task_id}: {target_path}")
254
+ return target_path
255
+
256
+ # Try each possible path
257
+ for dataset_path in possible_paths:
258
+ try:
259
+ logger.info(f"Attempting to download {dataset_path} from GAIA dataset...")
260
+
261
+ # Download to temp dir first to get the file
262
+ with tempfile.TemporaryDirectory() as temp_dir:
263
+ downloaded_path = hf_hub_download(
264
+ repo_id=repo_id,
265
+ filename=dataset_path,
266
+ repo_type="dataset",
267
+ local_dir=temp_dir,
268
+ )
269
+
270
+ # Copy file to target location (flat structure in cache)
271
+ shutil.copy(downloaded_path, target_path)
272
+
273
+ logger.info(f"Downloaded file for {task_id}: {target_path}")
274
+ return target_path
275
+
276
+ except Exception as e:
277
+ logger.debug(f"Path {dataset_path} not found: {e}")
278
+ continue
279
+
280
+ logger.warning(f"File not found in GAIA dataset for task {task_id}")
281
+ return None
282
+
283
+
284
+ def test_single_question(question: str, llm_provider: str):
285
  """Test agent with a single question and return diagnostics."""
286
  if not question or not question.strip():
287
  return "Please enter a question.", "", check_api_keys()
 
289
  try:
290
  # Set LLM provider from UI selection (overrides .env)
291
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
 
292
 
293
+ logger.info(f"UI Config: LLM_PROVIDER={llm_provider}")
 
 
294
 
295
  # Initialize agent
296
  agent = GAIAAgent()
 
302
  final_state = agent.last_state or {}
303
 
304
  # Format diagnostics with LLM provider info
305
+ provider_info = f"**LLM Provider:** {llm_provider}\n\n"
306
  diagnostics = provider_info + format_diagnostics(final_state)
307
  api_status = check_api_keys()
308
 
 
323
  # Stage 6: Async processing with ThreadPoolExecutor
324
 
325
 
326
+ def a_determine_status(answer: str) -> tuple[bool, str | None]:
327
+ """Determine if response is system error or AI answer.
328
+
329
+ Returns:
330
+ (is_system_error, error_log)
331
+ - is_system_error: True if system error, False if AI answer
332
+ - error_log: Full error message if system error, None otherwise
333
+ """
334
+ if not answer:
335
+ return True, "Empty answer"
336
+
337
+ answer_lower = answer.lower().strip()
338
+
339
+ # System/technical errors from our code
340
+ if answer_lower.startswith("error:") or "no evidence collected" in answer_lower:
341
+ return True, answer # Full error message as log
342
+
343
+ # Everything else is an AI response (including "Unable to answer")
344
+ return False, None
345
+
346
+
347
  def process_single_question(agent, item, index, total):
348
  """Process single question with agent, return result with error handling.
349
+ Supports file attachments - downloads files before processing.
350
 
351
  Args:
352
  agent: GAIAAgent instance
353
+ item: Question item dict with task_id, question, and optional file_name
354
  index: Question index (0-based)
355
  total: Total number of questions
356
 
 
359
  """
360
  task_id = item.get("task_id")
361
  question_text = item.get("question")
362
+ file_name = item.get("file_name")
363
 
364
  if not task_id or question_text is None:
365
+ answer = "ERROR: Missing task_id or question"
366
+ is_error, error_log = a_determine_status(answer)
367
  return {
368
  "task_id": task_id,
369
  "question": question_text,
370
+ "answer": answer,
371
+ "system_error": "yes" if is_error else "no",
372
+ "error_log": error_log,
373
  "error": True,
374
  }
375
 
376
+ # Download file if question has attachment
377
+ file_path = None
378
+ if file_name:
379
+ file_path = download_task_file(task_id, file_name)
380
+ if file_path:
381
+ logger.info(f"[{index + 1}/{total}] File downloaded: {file_path}")
382
+ else:
383
+ logger.warning(f"[{index + 1}/{total}] File expected but not downloaded")
384
+
385
  try:
386
  logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
387
+
388
+ # Pass file_path to agent if available
389
+ submitted_answer = agent(question_text, file_path=file_path)
390
+
391
  logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
392
 
393
+ is_error, error_log = a_determine_status(submitted_answer)
394
  return {
395
  "task_id": task_id,
396
  "question": question_text,
397
  "answer": submitted_answer,
398
+ "system_error": "yes" if is_error else "no",
399
+ "error_log": error_log,
400
  "error": False,
401
  }
402
  except Exception as e:
403
  logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
404
+ answer = f"ERROR: {str(e)}"
405
+ is_error, error_log = a_determine_status(answer)
406
  return {
407
  "task_id": task_id,
408
  "question": question_text,
409
+ "answer": answer,
410
+ "system_error": "yes" if is_error else "no",
411
+ "error_log": error_log,
412
  "error": True,
413
  }
414
 
415
 
416
  def run_and_submit_all(
417
  llm_provider: str,
 
418
  question_limit: int = 0,
419
+ task_ids: str = "",
420
  profile: gr.OAuthProfile | None = None,
421
  ):
422
  """
 
425
 
426
  Args:
427
  llm_provider: LLM provider to use
 
428
  question_limit: Limit number of questions (0 = process all)
429
+ task_ids: Comma-separated task IDs to target (overrides question_limit)
430
  profile: OAuth profile for HF login
431
  """
432
  # Start execution timer
 
448
 
449
  # Set LLM provider from UI selection (overrides .env)
450
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
451
+ logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
 
 
 
452
 
453
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
454
  try:
 
486
  f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
487
  )
488
 
489
+ # Filter by specific task IDs if provided (overrides question limit)
490
+ if task_ids and task_ids.strip():
491
+ target_ids = [tid.strip() for tid in task_ids.split(",")]
492
+ original_count = len(questions_data)
493
+ questions_data = [
494
+ q for q in questions_data if q.get("task_id") in target_ids
495
+ ]
496
+ found_ids = [q.get("task_id") for q in questions_data]
497
+ missing_ids = set(target_ids) - set(found_ids)
498
+
499
+ if missing_ids:
500
+ logger.warning(f"Task IDs not found: {missing_ids}")
501
+
502
+ logger.warning(
503
+ f"DEBUG MODE: Targeted {len(questions_data)}/{original_count} questions by task_id"
504
+ )
505
+ print(
506
+ f"DEBUG MODE: Processing {len(questions_data)} targeted questions "
507
+ f"({len(missing_ids)} IDs not found: {missing_ids})"
508
+ )
509
+
510
  print(f"Processing {len(questions_data)} questions.")
511
  except requests.exceptions.RequestException as e:
512
  print(f"Error fetching questions: {e}")
 
562
  result_entry = {
563
  "Task ID": result["task_id"],
564
  "Question": result["question"],
565
+ "System Error": result.get("system_error", "no"),
566
+ "Submitted Answer": ""
567
+ if result.get("system_error") == "yes"
568
+ else result["answer"],
569
  }
570
 
571
+ # Add error log if system error
572
+ if result.get("system_error") == "yes" and result.get("error_log"):
573
+ result_entry["Error Log"] = result["error_log"]
574
+
575
  # Add ground truth data if available
576
  if is_correct is not None:
577
  result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
 
581
 
582
  results_log.append(result_entry)
583
 
584
+ # Add to submission payload if no system error
585
+ if result.get("system_error") == "no":
586
  answers_payload.append(
587
  {"task_id": result["task_id"], "submitted_answer": result["answer"]}
588
  )
 
723
  value="HuggingFace",
724
  info="Select which LLM to use for this test",
725
  )
 
 
 
 
 
726
 
727
  test_button = gr.Button("Run Test", variant="primary")
728
 
 
744
  inputs=[
745
  test_question_input,
746
  llm_provider_dropdown,
 
747
  ],
748
  outputs=[test_answer_output, test_diagnostics_output, test_api_status],
749
  )
 
774
  value="HuggingFace",
775
  info="Select which LLM to use for all questions",
776
  )
 
 
 
 
 
777
  eval_question_limit = gr.Number(
778
  label="Question Limit (Debug)",
779
  value=0,
 
783
  info="Limit questions for testing (0 = process all)",
784
  )
785
 
786
+ with gr.Row():
787
+ eval_task_ids = gr.Textbox(
788
+ label="Target Task IDs (Debug)",
789
+ value="",
790
+ placeholder="task_id1, task_id2, ...",
791
+ info="Comma-separated task IDs to run (overrides question limit)",
792
+ lines=1,
793
+ )
794
+
795
  run_button = gr.Button("Run Evaluation & Submit All Answers")
796
 
797
  status_output = gr.Textbox(
 
806
  fn=run_and_submit_all,
807
  inputs=[
808
  eval_llm_provider_dropdown,
 
809
  eval_question_limit,
810
+ eval_task_ids,
811
  ],
812
  outputs=[status_output, results_table, export_output],
813
  )
docs/gaia_submission_guide.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GAIA Submission Guide
2
+
3
+ ## Two Different Leaderboards
4
+
5
+ ### 1. Course Leaderboard (CURRENT - Course Assignment)
6
+
7
+ **API Endpoint:** `https://agents-course-unit4-scoring.hf.space`
8
+
9
+ **Purpose:** Hugging Face Agents Course Unit 4 assignment
10
+
11
+ **Dataset:** 20 questions from GAIA validation set (level 1), filtered by tools/steps complexity
12
+
13
+ **Target Score:** 30% = **6/20 correct**
14
+
15
+ **API Routes:**
16
+ - `GET /questions` - Retrieve full list of evaluation questions
17
+ - `GET /random-question` - Fetch single random question
18
+ - `GET /files/{task_id}` - Download file associated with task
19
+ - `POST /submit` - Submit answers for scoring
20
+
21
+ **Submission Format:**
22
+ ```json
23
+ {
24
+ "username": "your-hf-username",
25
+ "agent_code": "https://huggingface.co/spaces/your-username/your-space/tree/main",
26
+ "answers": [
27
+ {"task_id": "...", "submitted_answer": "..."}
28
+ ]
29
+ }
30
+ ```
31
+
32
+ **Scoring:** EXACT MATCH with ground truth
33
+ - Answer should be plain text, NO "FINAL ANSWER:" prefix
34
+ - Answer should be precise and well-formatted
35
+
36
+ **Debugging Features (Course-Specific):**
37
+ - ✅ "Target Task IDs" - Run specific questions for debugging
38
+ - ✅ "Question Limit" - Run first N questions for testing
39
+ - ✅ Course API is forgiving for development iteration
40
+
41
+ **Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
42
+
43
+ ---
44
+
45
+ ### 2. Official GAIA Leaderboard (FUTURE - Not Yet Implemented)
46
+
47
+ **Space:** https://huggingface.co/spaces/gaia-benchmark/leaderboard
48
+
49
+ **Purpose:** Official GAIA benchmark for AI research community
50
+
51
+ **Dataset:** Full GAIA benchmark (450+ questions across 3 levels)
52
+
53
+ **Submission Format:** File upload (JSON) with model metadata
54
+ - Model name, family, parameters
55
+ - Complete answers for ALL questions
56
+ - Different evaluation process
57
+
58
+ **Status:** ⚠️ **FUTURE DEVELOPMENT** - Not implemented in this template
59
+
60
+ **Differences from Course:**
61
+ | Aspect | Course | Official GAIA |
62
+ |--------|--------|--------------|
63
+ | Dataset Size | 20 questions | 450+ questions |
64
+ | Submission Method | API POST | File upload |
65
+ | Question Filtering | Allowed for debugging | Must submit ALL |
66
+ | Scoring | Exact match | TBC (likely more flexible) |
67
+
68
+ **Documentation:** https://huggingface.co/datasets/gaia-benchmark/GAIA
69
+
70
+ ---
71
+
72
+ ## Implementation Notes
73
+
74
+ ### Current Implementation Status
75
+
76
+ **✅ Implemented:**
77
+ - Course API integration (`/questions`, `/submit`, `/files/{task_id}`)
78
+ - Agent execution with LangGraph StateGraph
79
+ - OAuth login integration
80
+ - Debug features (Target Task IDs, Question Limit)
81
+ - Results export (JSON format)
82
+
83
+ **⚠️ Course Constraints:**
84
+ - Only 20 level 1 questions
85
+ - Exact match scoring (strict)
86
+ - Agent code must be public
87
+
88
+ **🔮 Future Work (Official GAIA):**
89
+ - File-based submission format
90
+ - Full 450+ question support
91
+ - Leaderboard-specific metadata
92
+ - Official evaluation pipeline
93
+
94
+ ---
95
+
96
+ ## Development Workflow
97
+
98
+ ### For Course Assignment:
99
+
100
+ 1. **Develop:** Use "Target Task IDs" to test specific questions
101
+ 2. **Debug:** Use "Question Limit" for quick iteration
102
+ 3. **Test:** Run full evaluation on all 20 questions
103
+ 4. **Submit:** Course API evaluates exact match score
104
+ 5. **Iterate:** Improve prompts, tools, reasoning
105
+
106
+ ### For Official GAIA (Future):
107
+
108
+ 1. **Generate:** Create submission JSON with all 450+ answers
109
+ 2. **Format:** Follow official GAIA format requirements
110
+ 3. **Upload:** Submit via gaia-benchmark/leaderboard Space
111
+ 4. **Evaluate:** Official benchmark evaluation
112
+
113
+ ---
114
+
115
+ ## References
116
+
117
+ - **Course Documentation:** https://huggingface.co/learn/agents-course/en/unit4/hands-on
118
+ - **Course Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
119
+ - **Official GAIA Dataset:** https://huggingface.co/datasets/gaia-benchmark/GAIA
120
+ - **Official GAIA Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/leaderboard
src/agent/graph.py CHANGED
@@ -15,6 +15,7 @@ Based on:
15
 
16
  import logging
17
  import os
 
18
  from typing import TypedDict, List, Optional
19
  from langgraph.graph import StateGraph, END
20
  from src.config import Settings
@@ -100,9 +101,12 @@ def validate_environment() -> List[str]:
100
  # ============================================================================
101
 
102
 
103
- def fallback_tool_selection(question: str, plan: str) -> List[dict]:
 
 
104
  """
105
  MVP Fallback: Simple keyword-based tool selection when LLM fails.
 
106
 
107
  This is a temporary hack to get basic functionality working.
108
  Uses simple keyword matching to select tools.
@@ -110,6 +114,7 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
110
  Args:
111
  question: The user question
112
  plan: The execution plan
 
113
 
114
  Returns:
115
  List of tool calls with basic parameters
@@ -147,17 +152,37 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
147
  })
148
  logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
149
 
150
- # File tool: keywords like "file", "parse", "read", "csv", "json", "txt"
151
- file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
152
- if any(keyword in combined for keyword in file_keywords):
153
- # Cannot extract filename without more info, skip for now
154
- logger.warning("[fallback_tool_selection] File operation detected but cannot extract filename")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
155
 
156
  # Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
157
  image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
158
  if any(keyword in combined for keyword in image_keywords):
159
- # Cannot extract image path without more info, skip for now
160
- logger.warning("[fallback_tool_selection] Image operation detected but cannot extract image path")
 
 
 
161
 
162
  if not tool_calls:
163
  logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
@@ -256,7 +281,10 @@ def execute_node(state: AgentState) -> AgentState:
256
  # Stage 3: Use LLM function calling to select tools and extract parameters
257
  logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
258
  tool_calls = select_tools_with_function_calling(
259
- question=state["question"], plan=state["plan"], available_tools=TOOLS
 
 
 
260
  )
261
 
262
  # Validate tool_calls result
@@ -264,13 +292,17 @@ def execute_node(state: AgentState) -> AgentState:
264
  logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
265
  state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
266
  # MVP HACK: Use fallback keyword-based tool selection
267
- tool_calls = fallback_tool_selection(state["question"], state["plan"])
 
 
268
  logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
269
  elif not isinstance(tool_calls, list):
270
  logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
271
  state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
272
  # MVP HACK: Use fallback
273
- tool_calls = fallback_tool_selection(state["question"], state["plan"])
 
 
274
  else:
275
  logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
276
  logger.debug(f"[execute_node] Tool calls: {tool_calls}")
@@ -305,8 +337,32 @@ def execute_node(state: AgentState) -> AgentState:
305
  }
306
  )
307
 
308
- # Extract evidence
309
- evidence.append(f"[{tool_name}] {result}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
310
 
311
  except Exception as tool_error:
312
  logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
@@ -342,7 +398,9 @@ def execute_node(state: AgentState) -> AgentState:
342
  if not tool_calls:
343
  logger.info(f"[execute_node] Attempting fallback after exception...")
344
  try:
345
- tool_calls = fallback_tool_selection(state["question"], state.get("plan", ""))
 
 
346
  logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
347
 
348
  # Try to execute fallback tools
@@ -367,7 +425,28 @@ def execute_node(state: AgentState) -> AgentState:
367
  "result": result,
368
  "status": "success"
369
  })
370
- evidence.append(f"[{tool_name}] {result}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
371
  logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
372
  except Exception as tool_error:
373
  logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
@@ -504,22 +583,26 @@ class GAIAAgent:
504
  self.last_state = None # Store last execution state for diagnostics
505
  print("GAIAAgent initialized successfully")
506
 
507
- def __call__(self, question: str) -> str:
508
  """
509
  Process question and return answer.
 
510
 
511
  Args:
512
  question: GAIA question text
 
513
 
514
  Returns:
515
  Factoid answer string
516
  """
517
  print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
 
 
518
 
519
  # Initialize state
520
  initial_state: AgentState = {
521
  "question": question,
522
- "file_paths": None,
523
  "plan": None,
524
  "tool_calls": [],
525
  "tool_results": [],
 
15
 
16
  import logging
17
  import os
18
+ from pathlib import Path
19
  from typing import TypedDict, List, Optional
20
  from langgraph.graph import StateGraph, END
21
  from src.config import Settings
 
101
  # ============================================================================
102
 
103
 
104
+ def fallback_tool_selection(
105
+ question: str, plan: str, file_paths: Optional[List[str]] = None
106
+ ) -> List[dict]:
107
  """
108
  MVP Fallback: Simple keyword-based tool selection when LLM fails.
109
+ Enhanced to use actual file paths when available.
110
 
111
  This is a temporary hack to get basic functionality working.
112
  Uses simple keyword matching to select tools.
 
114
  Args:
115
  question: The user question
116
  plan: The execution plan
117
+ file_paths: Optional list of downloaded file paths
118
 
119
  Returns:
120
  List of tool calls with basic parameters
 
152
  })
153
  logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
154
 
155
+ # File tool: if file_paths available, use them
156
+ if file_paths:
157
+ for file_path in file_paths:
158
+ # Determine file type and appropriate tool
159
+ file_ext = Path(file_path).suffix.lower()
160
+ if file_ext in ['.png', '.jpg', '.jpeg']:
161
+ tool_calls.append({
162
+ "tool": "vision",
163
+ "params": {"image_path": file_path}
164
+ })
165
+ logger.info(f"[fallback_tool_selection] Added vision tool for image: {file_path}")
166
+ elif file_ext in ['.pdf', '.xlsx', '.xls', '.csv', '.json', '.txt', '.docx', '.doc']:
167
+ tool_calls.append({
168
+ "tool": "parse_file",
169
+ "params": {"file_path": file_path}
170
+ })
171
+ logger.info(f"[fallback_tool_selection] Added parse_file tool for: {file_path}")
172
+ else:
173
+ # Keyword-based file detection (legacy)
174
+ file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
175
+ if any(keyword in combined for keyword in file_keywords):
176
+ logger.warning("[fallback_tool_selection] File operation detected but no file_paths available")
177
 
178
  # Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
179
  image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
180
  if any(keyword in combined for keyword in image_keywords):
181
+ if file_paths:
182
+ # Already handled above in file_paths check
183
+ pass
184
+ else:
185
+ logger.warning("[fallback_tool_selection] Image operation detected but no file_paths available")
186
 
187
  if not tool_calls:
188
  logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
 
281
  # Stage 3: Use LLM function calling to select tools and extract parameters
282
  logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
283
  tool_calls = select_tools_with_function_calling(
284
+ question=state["question"],
285
+ plan=state["plan"],
286
+ available_tools=TOOLS,
287
+ file_paths=state.get("file_paths"),
288
  )
289
 
290
  # Validate tool_calls result
 
292
  logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
293
  state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
294
  # MVP HACK: Use fallback keyword-based tool selection
295
+ tool_calls = fallback_tool_selection(
296
+ state["question"], state["plan"], state.get("file_paths")
297
+ )
298
  logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
299
  elif not isinstance(tool_calls, list):
300
  logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
301
  state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
302
  # MVP HACK: Use fallback
303
+ tool_calls = fallback_tool_selection(
304
+ state["question"], state["plan"], state.get("file_paths")
305
+ )
306
  else:
307
  logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
308
  logger.debug(f"[execute_node] Tool calls: {tool_calls}")
 
337
  }
338
  )
339
 
340
+ # Extract evidence - handle different result formats
341
+ if isinstance(result, dict):
342
+ # Vision tool returns {"answer": "..."}
343
+ if "answer" in result:
344
+ evidence.append(result["answer"])
345
+ # Search tools return {"results": [...], "source": "...", "query": "..."}
346
+ elif "results" in result:
347
+ # Format search results as readable text
348
+ results_list = result.get("results", [])
349
+ if results_list:
350
+ # Take first 3 results and format them
351
+ formatted = []
352
+ for r in results_list[:3]:
353
+ title = r.get("title", "")[:100]
354
+ url = r.get("url", "")[:100]
355
+ snippet = r.get("snippet", "")[:200]
356
+ formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
357
+ evidence.append("\n\n".join(formatted))
358
+ else:
359
+ evidence.append(str(result))
360
+ else:
361
+ evidence.append(str(result))
362
+ elif isinstance(result, str):
363
+ evidence.append(result)
364
+ else:
365
+ evidence.append(str(result))
366
 
367
  except Exception as tool_error:
368
  logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
 
398
  if not tool_calls:
399
  logger.info(f"[execute_node] Attempting fallback after exception...")
400
  try:
401
+ tool_calls = fallback_tool_selection(
402
+ state["question"], state.get("plan", ""), state.get("file_paths")
403
+ )
404
  logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
405
 
406
  # Try to execute fallback tools
 
425
  "result": result,
426
  "status": "success"
427
  })
428
+ # Extract evidence - handle different result formats
429
+ if isinstance(result, dict):
430
+ if "answer" in result:
431
+ evidence.append(result["answer"])
432
+ elif "results" in result:
433
+ results_list = result.get("results", [])
434
+ if results_list:
435
+ formatted = []
436
+ for r in results_list[:3]:
437
+ title = r.get("title", "")[:100]
438
+ url = r.get("url", "")[:100]
439
+ snippet = r.get("snippet", "")[:200]
440
+ formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
441
+ evidence.append("\n\n".join(formatted))
442
+ else:
443
+ evidence.append(str(result))
444
+ else:
445
+ evidence.append(str(result))
446
+ elif isinstance(result, str):
447
+ evidence.append(result)
448
+ else:
449
+ evidence.append(str(result))
450
  logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
451
  except Exception as tool_error:
452
  logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
 
583
  self.last_state = None # Store last execution state for diagnostics
584
  print("GAIAAgent initialized successfully")
585
 
586
+ def __call__(self, question: str, file_path: Optional[str] = None) -> str:
587
  """
588
  Process question and return answer.
589
+ Supports optional file attachment for file-based questions.
590
 
591
  Args:
592
  question: GAIA question text
593
+ file_path: Optional path to downloaded file attachment
594
 
595
  Returns:
596
  Factoid answer string
597
  """
598
  print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
599
+ if file_path:
600
+ print(f"GAIAAgent processing file: {file_path}")
601
 
602
  # Initialize state
603
  initial_state: AgentState = {
604
  "question": question,
605
+ "file_paths": [file_path] if file_path else None,
606
  "plan": None,
607
  "tool_calls": [],
608
  "tool_results": [],
src/agent/llm_client.py CHANGED
@@ -158,7 +158,10 @@ def _get_provider_function(function_name: str, provider: str) -> Callable:
158
 
159
  def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
160
  """
161
- Call LLM function with configured provider and optional fallback.
 
 
 
162
 
163
  Args:
164
  function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
@@ -168,55 +171,28 @@ def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
168
  Result from LLM call
169
 
170
  Raises:
171
- Exception: If selected provider fails and fallback disabled, or all providers fail
172
  """
173
  # Read config at runtime for UI flexibility
174
  primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
175
- enable_fallback = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
176
 
177
- # Define fallback order (excluding primary provider)
178
- all_providers = ["gemini", "huggingface", "groq", "claude"]
179
- fallback_providers = [p for p in all_providers if p != primary_provider]
 
 
180
 
181
- # Try primary provider first
182
  try:
183
  primary_func = _get_provider_function(function_name, primary_provider)
184
- logger.info(f"[{function_name}] Using primary provider: {primary_provider}")
185
  return retry_with_backoff(lambda: primary_func(*args, **kwargs))
186
  except Exception as primary_error:
187
- logger.warning(
188
- f"[{function_name}] Primary provider {primary_provider} failed: {primary_error}"
 
189
  )
190
 
191
- # If fallback disabled, raise immediately
192
- if not enable_fallback:
193
- logger.error(f"[{function_name}] Fallback disabled. Failing fast.")
194
- raise Exception(
195
- f"{function_name} failed with {primary_provider}: {primary_error}. "
196
- f"Fallback disabled (ENABLE_LLM_FALLBACK=false)"
197
- )
198
-
199
- # Try fallback providers in order
200
- errors = {primary_provider: primary_error}
201
- for fallback_provider in fallback_providers:
202
- try:
203
- fallback_func = _get_provider_function(function_name, fallback_provider)
204
- logger.info(
205
- f"[{function_name}] Trying fallback provider: {fallback_provider}"
206
- )
207
- return retry_with_backoff(lambda: fallback_func(*args, **kwargs))
208
- except Exception as fallback_error:
209
- errors[fallback_provider] = fallback_error
210
- logger.warning(
211
- f"[{function_name}] Fallback provider {fallback_provider} failed: {fallback_error}"
212
- )
213
- continue
214
-
215
- # All providers failed
216
- error_summary = ", ".join([f"{k}: {v}" for k, v in errors.items()])
217
- logger.error(f"[{function_name}] All providers failed. {error_summary}")
218
- raise Exception(f"{function_name} failed with all providers. {error_summary}")
219
-
220
 
221
  # ============================================================================
222
  # Client Initialization
@@ -560,7 +536,7 @@ def plan_question(
560
 
561
 
562
  def select_tools_claude(
563
- question: str, plan: str, available_tools: Dict[str, Dict]
564
  ) -> List[Dict[str, Any]]:
565
  """Use Claude function calling to select tools and extract parameters."""
566
  client = create_claude_client()
@@ -580,15 +556,28 @@ def select_tools_claude(
580
  }
581
  )
582
 
 
 
 
 
 
 
 
 
 
 
 
 
583
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
584
 
585
  Few-shot examples:
586
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
587
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
588
- - "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
589
- - "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
590
 
591
  Execute the plan step by step. Extract correct parameters from the question.
 
592
 
593
  Plan:
594
  {plan}"""
@@ -633,7 +622,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
633
 
634
 
635
  def select_tools_gemini(
636
- question: str, plan: str, available_tools: Dict[str, Dict]
637
  ) -> List[Dict[str, Any]]:
638
  """Use Gemini function calling to select tools and extract parameters."""
639
  model = create_gemini_client()
@@ -665,15 +654,28 @@ def select_tools_gemini(
665
  )
666
  )
667
 
 
 
 
 
 
 
 
 
 
 
 
 
668
  prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
669
 
670
  Few-shot examples:
671
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
672
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
673
- - "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
674
- - "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
675
 
676
  Execute the plan step by step. Extract correct parameters from the question.
 
677
 
678
  Plan:
679
  {plan}
@@ -718,7 +720,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
718
 
719
 
720
  def select_tools_hf(
721
- question: str, plan: str, available_tools: Dict[str, Dict]
722
  ) -> List[Dict[str, Any]]:
723
  """Use HuggingFace Inference API with function calling to select tools and extract parameters."""
724
  client = create_hf_client()
@@ -748,15 +750,28 @@ def select_tools_hf(
748
 
749
  tools.append(tool_schema)
750
 
 
 
 
 
 
 
 
 
 
 
 
 
751
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
752
 
753
  Few-shot examples:
754
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
755
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
756
- - "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
757
- - "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
758
 
759
  Execute the plan step by step. Extract correct parameters from the question.
 
760
 
761
  Plan:
762
  {plan}"""
@@ -766,7 +781,7 @@ Plan:
766
  Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
767
 
768
  logger.info(
769
- f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools"
770
  )
771
 
772
  messages = [
@@ -807,7 +822,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
807
 
808
 
809
  def select_tools_groq(
810
- question: str, plan: str, available_tools: Dict[str, Dict]
811
  ) -> List[Dict[str, Any]]:
812
  """Use Groq with function calling to select tools and extract parameters."""
813
  client = create_groq_client()
@@ -837,15 +852,28 @@ def select_tools_groq(
837
 
838
  tools.append(tool_schema)
839
 
 
 
 
 
 
 
 
 
 
 
 
 
840
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
841
 
842
  Few-shot examples:
843
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
844
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
845
- - "Analyze the image at example.com/pic.jpg" → vision(image_url="example.com/pic.jpg")
846
- - "What's in the uploaded Excel file?" → parse_file(file_path="<provided_path>")
847
 
848
  Execute the plan step by step. Extract correct parameters from the question.
 
849
 
850
  Plan:
851
  {plan}"""
@@ -900,7 +928,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
900
 
901
 
902
  def select_tools_with_function_calling(
903
- question: str, plan: str, available_tools: Dict[str, Dict]
904
  ) -> List[Dict[str, Any]]:
905
  """
906
  Use LLM function calling to dynamically select tools and extract parameters.
@@ -913,11 +941,12 @@ def select_tools_with_function_calling(
913
  question: GAIA question text
914
  plan: Execution plan from planning phase
915
  available_tools: Tool registry
 
916
 
917
  Returns:
918
  List of tool calls with extracted parameters
919
  """
920
- return _call_with_fallback("select_tools", question, plan, available_tools)
921
 
922
 
923
  # ============================================================================
 
158
 
159
  def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
160
  """
161
+ Call LLM function with configured provider.
162
+
163
+ NOTE: Fallback mechanism has been archived to reduce complexity.
164
+ Only the primary provider is used. If it fails, the error is raised directly.
165
 
166
  Args:
167
  function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
 
171
  Result from LLM call
172
 
173
  Raises:
174
+ Exception: If primary provider fails
175
  """
176
  # Read config at runtime for UI flexibility
177
  primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
 
178
 
179
+ # ============================================================================
180
+ # ARCHIVED: Fallback mechanism removed to reduce complexity
181
+ # Original fallback code was at: dev/dev_260112_02_fallback_archived.md
182
+ # To restore: Check git history or archived dev file
183
+ # ============================================================================
184
 
185
+ # Try primary provider only (no fallback)
186
  try:
187
  primary_func = _get_provider_function(function_name, primary_provider)
188
+ logger.info(f"[{function_name}] Using provider: {primary_provider}")
189
  return retry_with_backoff(lambda: primary_func(*args, **kwargs))
190
  except Exception as primary_error:
191
+ logger.error(f"[{function_name}] Provider {primary_provider} failed: {primary_error}")
192
+ raise Exception(
193
+ f"{function_name} failed with {primary_provider}: {primary_error}"
194
  )
195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
196
 
197
  # ============================================================================
198
  # Client Initialization
 
536
 
537
 
538
  def select_tools_claude(
539
+ question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
540
  ) -> List[Dict[str, Any]]:
541
  """Use Claude function calling to select tools and extract parameters."""
542
  client = create_claude_client()
 
556
  }
557
  )
558
 
559
+ # File context for tool selection
560
+ file_context = ""
561
+ if file_paths:
562
+ file_context = f"""
563
+
564
+ IMPORTANT: These files are available for this question:
565
+ {chr(10).join(f"- {fp}" for fp in file_paths)}
566
+
567
+ When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
568
+ For vision tools with images: vision(image_path="<actual_file_path>")
569
+ For file parsing tools: parse_file(file_path="<actual_file_path>")"""
570
+
571
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
572
 
573
  Few-shot examples:
574
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
575
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
576
+ - "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
577
+ - "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
578
 
579
  Execute the plan step by step. Extract correct parameters from the question.
580
+ Use actual file paths when files are provided.{file_context}
581
 
582
  Plan:
583
  {plan}"""
 
622
 
623
 
624
  def select_tools_gemini(
625
+ question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
626
  ) -> List[Dict[str, Any]]:
627
  """Use Gemini function calling to select tools and extract parameters."""
628
  model = create_gemini_client()
 
654
  )
655
  )
656
 
657
+ # File context for tool selection
658
+ file_context = ""
659
+ if file_paths:
660
+ file_context = f"""
661
+
662
+ IMPORTANT: These files are available for this question:
663
+ {chr(10).join(f"- {fp}" for fp in file_paths)}
664
+
665
+ When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
666
+ For vision tools with images: vision(image_path="<actual_file_path>")
667
+ For file parsing tools: parse_file(file_path="<actual_file_path>")"""
668
+
669
  prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
670
 
671
  Few-shot examples:
672
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
673
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
674
+ - "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
675
+ - "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
676
 
677
  Execute the plan step by step. Extract correct parameters from the question.
678
+ Use actual file paths when files are provided.{file_context}
679
 
680
  Plan:
681
  {plan}
 
720
 
721
 
722
  def select_tools_hf(
723
+ question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
724
  ) -> List[Dict[str, Any]]:
725
  """Use HuggingFace Inference API with function calling to select tools and extract parameters."""
726
  client = create_hf_client()
 
750
 
751
  tools.append(tool_schema)
752
 
753
+ # File context for tool selection
754
+ file_context = ""
755
+ if file_paths:
756
+ file_context = f"""
757
+
758
+ IMPORTANT: These files are available for this question:
759
+ {chr(10).join(f"- {fp}" for fp in file_paths)}
760
+
761
+ When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
762
+ For vision tools with images: vision(image_path="<actual_file_path>")
763
+ For file parsing tools: parse_file(file_path="<actual_file_path>")"""
764
+
765
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
766
 
767
  Few-shot examples:
768
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
769
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
770
+ - "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
771
+ - "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
772
 
773
  Execute the plan step by step. Extract correct parameters from the question.
774
+ Use actual file paths when files are provided.{file_context}
775
 
776
  Plan:
777
  {plan}"""
 
781
  Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
782
 
783
  logger.info(
784
+ f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools, file_paths={file_paths}"
785
  )
786
 
787
  messages = [
 
822
 
823
 
824
  def select_tools_groq(
825
+ question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
826
  ) -> List[Dict[str, Any]]:
827
  """Use Groq with function calling to select tools and extract parameters."""
828
  client = create_groq_client()
 
852
 
853
  tools.append(tool_schema)
854
 
855
+ # File context for tool selection
856
+ file_context = ""
857
+ if file_paths:
858
+ file_context = f"""
859
+
860
+ IMPORTANT: These files are available for this question:
861
+ {chr(10).join(f"- {fp}" for fp in file_paths)}
862
+
863
+ When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
864
+ For vision tools with images: vision(image_path="<actual_file_path>")
865
+ For file parsing tools: parse_file(file_path="<actual_file_path>")"""
866
+
867
  system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
868
 
869
  Few-shot examples:
870
  - "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
871
  - "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
872
+ - "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
873
+ - "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
874
 
875
  Execute the plan step by step. Extract correct parameters from the question.
876
+ Use actual file paths when files are provided.{file_context}
877
 
878
  Plan:
879
  {plan}"""
 
928
 
929
 
930
  def select_tools_with_function_calling(
931
+ question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
932
  ) -> List[Dict[str, Any]]:
933
  """
934
  Use LLM function calling to dynamically select tools and extract parameters.
 
941
  question: GAIA question text
942
  plan: Execution plan from planning phase
943
  available_tools: Tool registry
944
+ file_paths: Optional list of downloaded file paths for file-based questions
945
 
946
  Returns:
947
  List of tool calls with extracted parameters
948
  """
949
+ return _call_with_fallback("select_tools", question, plan, available_tools, file_paths)
950
 
951
 
952
  # ============================================================================
src/tools/calculator.py CHANGED
@@ -93,20 +93,33 @@ def timeout(seconds: int):
93
 
94
  Raises:
95
  TimeoutError: If execution exceeds timeout
 
 
 
 
96
  """
97
  def timeout_handler(signum, frame):
98
  raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
99
 
100
- # Set signal handler
101
- old_handler = signal.signal(signal.SIGALRM, timeout_handler)
102
- signal.alarm(seconds)
 
 
 
 
 
 
 
 
103
 
104
  try:
105
  yield
106
  finally:
107
  # Restore old handler and cancel alarm
108
- signal.alarm(0)
109
- signal.signal(signal.SIGALRM, old_handler)
 
110
 
111
 
112
  # ============================================================================
 
93
 
94
  Raises:
95
  TimeoutError: If execution exceeds timeout
96
+
97
+ Note:
98
+ signal.alarm() only works in main thread. In threaded contexts
99
+ (Gradio, ThreadPoolExecutor), timeout protection is disabled.
100
  """
101
  def timeout_handler(signum, frame):
102
  raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
103
 
104
+ try:
105
+ # Set signal handler (only works in main thread)
106
+ old_handler = signal.signal(signal.SIGALRM, timeout_handler)
107
+ signal.alarm(seconds)
108
+ _alarm_set = True
109
+ except (ValueError, AttributeError):
110
+ # ValueError: signal.alarm() in non-main thread
111
+ # AttributeError: signal.SIGALRM not available (Windows)
112
+ logger.warning(f"Timeout protection disabled (threading/Windows limitation)")
113
+ _alarm_set = False
114
+ old_handler = None
115
 
116
  try:
117
  yield
118
  finally:
119
  # Restore old handler and cancel alarm
120
+ if _alarm_set and old_handler is not None:
121
+ signal.alarm(0)
122
+ signal.signal(signal.SIGALRM, old_handler)
123
 
124
 
125
  # ============================================================================
test/test_quick_fixes.py ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Quick test script for specific GAIA questions.
4
+ Use this to verify fixes without running full evaluation.
5
+
6
+ Usage:
7
+ uv run python test/test_quick_fixes.py
8
+ """
9
+
10
+ import os
11
+ import sys
12
+
13
+ # Add project root to path
14
+ sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
15
+
16
+ from src.agent.graph import GAIAAgent
17
+ from dotenv import load_dotenv
18
+
19
+ # Load environment variables
20
+ load_dotenv()
21
+
22
+ # ============================================================================
23
+ # CONFIG - Questions to test
24
+ # ============================================================================
25
+
26
+ TEST_QUESTIONS = [
27
+ {
28
+ "task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
29
+ "name": "Reverse sentence (calculator threading fix)",
30
+ "question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
31
+ "expected": "Right",
32
+ },
33
+ {
34
+ "task_id": "6f37996b-2ac7-44b0-8e68-6d28256631b4",
35
+ "name": "Table commutativity (LLM issue - table in question)",
36
+ "question": '''Given this table defining * on the set S = {a, b, c, d, e}
37
+
38
+ |*|a|b|c|d|e|
39
+ |---|---|---|---|---|
40
+ |a|a|b|c|b|d|
41
+ |b|b|c|a|e|c|
42
+ |c|c|a|b|b|a|
43
+ |d|b|e|b|e|d|
44
+ |e|d|b|a|d|c|
45
+
46
+ provide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.''',
47
+ "expected": "b, e",
48
+ },
49
+ ]
50
+
51
+ # ============================================================================
52
+
53
+
54
+ def test_question(agent: GAIAAgent, test_case: dict) -> dict:
55
+ """Test a single question and return result."""
56
+ task_id = test_case["task_id"]
57
+ question = test_case["question"]
58
+ expected = test_case.get("expected", "N/A")
59
+
60
+ print(f"\n{'='*60}")
61
+ print(f"Testing: {test_case['name']}")
62
+ print(f"Task ID: {task_id}")
63
+ print(f"Expected: {expected}")
64
+ print(f"{'='*60}")
65
+
66
+ try:
67
+ answer = agent(question, file_path=None)
68
+
69
+ # Check if answer matches expected
70
+ is_correct = answer.strip().lower() == expected.lower().strip()
71
+
72
+ result = {
73
+ "task_id": task_id,
74
+ "name": test_case["name"],
75
+ "question": question[:100] + "..." if len(question) > 100 else question,
76
+ "expected": expected,
77
+ "answer": answer,
78
+ "correct": is_correct,
79
+ "status": "success",
80
+ }
81
+
82
+ # Determine system error
83
+ if not answer:
84
+ result["system_error"] = "yes"
85
+ elif answer.lower().startswith("error:") or "no evidence collected" in answer.lower():
86
+ result["system_error"] = "yes"
87
+ result["error_log"] = answer
88
+ else:
89
+ result["system_error"] = "no"
90
+
91
+ except Exception as e:
92
+ result = {
93
+ "task_id": task_id,
94
+ "name": test_case["name"],
95
+ "question": question[:100] + "..." if len(question) > 100 else question,
96
+ "expected": expected,
97
+ "answer": f"ERROR: {str(e)}",
98
+ "correct": False,
99
+ "status": "error",
100
+ "system_error": "yes",
101
+ "error_log": str(e),
102
+ }
103
+
104
+ # Print result
105
+ status_icon = "✅" if result["correct"] else "❌" if result["system_error"] == "no" else "⚠️"
106
+ print(f"\n{status_icon} Result: {result['answer'][:100]}")
107
+ if result["system_error"] == "yes":
108
+ print(f" System Error: Yes")
109
+ if result.get("error_log"):
110
+ print(f" Error: {result['error_log'][:100]}")
111
+
112
+ return result
113
+
114
+
115
+ def main():
116
+ """Run quick tests on specific questions."""
117
+ print("\n" + "="*60)
118
+ print("GAIA Quick Test - Verify Fixes")
119
+ print("="*60)
120
+
121
+ # Check LLM provider
122
+ llm_provider = os.getenv("LLM_PROVIDER", "gemini")
123
+ print(f"\nLLM Provider: {llm_provider}")
124
+
125
+ # Initialize agent
126
+ print("\nInitializing agent...")
127
+ try:
128
+ agent = GAIAAgent()
129
+ print("✅ Agent initialized")
130
+ except Exception as e:
131
+ print(f"❌ Agent initialization failed: {e}")
132
+ return
133
+
134
+ # Run tests
135
+ results = []
136
+ for test_case in TEST_QUESTIONS:
137
+ result = test_question(agent, test_case)
138
+ results.append(result)
139
+
140
+ # Summary
141
+ print(f"\n{'='*60}")
142
+ print("SUMMARY")
143
+ print(f"{'='*60}")
144
+
145
+ success_count = sum(1 for r in results if r["correct"])
146
+ error_count = sum(1 for r in results if r["system_error"] == "yes")
147
+ ai_fail_count = sum(1 for r in results if r["system_error"] == "no" and not r["correct"])
148
+
149
+ print(f"\nTotal: {len(results)}")
150
+ print(f"✅ Correct: {success_count}")
151
+ print(f"⚠️ System Errors: {error_count}")
152
+ print(f"❌ AI Wrong: {ai_fail_count}")
153
+
154
+ # Detailed results
155
+ print(f"\nDetailed Results:")
156
+ for r in results:
157
+ status = "✅" if r["correct"] else "⚠️" if r["system_error"] == "yes" else "❌"
158
+ print(f" {status} {r['name']}: {r['answer'][:50]}{'...' if len(r['answer']) > 50 else ''}")
159
+
160
+
161
+ if __name__ == "__main__":
162
+ main()