feat: system error field, calculator fix, target task IDs, course vs GAIA docs
Browse filesChanges:
- System error field: Changed to boolean yes/no with error_log
- Calculator threading fix: Handle signal.alarm() failure in threads
- Target task IDs: Debug feature to run specific questions
- Course vs GAIA: Documentation distinguishing course API from official GAIA
- Quick test script: test/test_quick_fixes.py for targeted testing
Modified:
- app.py: System error field, target task IDs UI, submission logic
- src/tools/calculator.py: Thread-safe timeout handling
- src/agent/graph.py: Evidence formatting for dict results
- src/agent/llm_client.py: Fallback mechanism archived
- CHANGELOG.md: All changes documented
- README.md: Added submission guide reference
Added:
- docs/gaia_submission_guide.md: Complete submission guide
- test/test_quick_fixes.py: Targeted question testing
Co-Authored-By: Claude <noreply@anthropic.com>
- .gitignore +4 -0
- CHANGELOG.md +278 -0
- PLAN.md +165 -0
- README.md +2 -0
- app.py +178 -32
- docs/gaia_submission_guide.md +120 -0
- src/agent/graph.py +100 -17
- src/agent/llm_client.py +83 -54
- src/tools/calculator.py +18 -5
- test/test_quick_fixes.py +162 -0
|
@@ -29,6 +29,10 @@ Thumbs.db
|
|
| 29 |
|
| 30 |
# Input documents (PDFs not allowed in HF Spaces)
|
| 31 |
input/*.pdf
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
# Runtime cache (not in git, served via app download)
|
| 34 |
_cache/
|
|
|
|
| 29 |
|
| 30 |
# Input documents (PDFs not allowed in HF Spaces)
|
| 31 |
input/*.pdf
|
| 32 |
+
input/
|
| 33 |
+
|
| 34 |
+
# Downloaded GAIA question files
|
| 35 |
+
input/*
|
| 36 |
|
| 37 |
# Runtime cache (not in git, served via app download)
|
| 38 |
_cache/
|
|
@@ -1,5 +1,283 @@
|
|
| 1 |
# Session Changelog
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
|
| 4 |
|
| 5 |
**Problem:** Need to validate HF vision works before complex GAIA evaluation.
|
|
|
|
| 1 |
# Session Changelog
|
| 2 |
|
| 3 |
+
## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
|
| 4 |
+
|
| 5 |
+
**Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
|
| 6 |
+
|
| 7 |
+
**Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
|
| 8 |
+
|
| 9 |
+
**Solution:** Created `docs/gaia_submission_guide.md` documenting:
|
| 10 |
+
- **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
|
| 11 |
+
- **Official GAIA Leaderboard** (future): 450+ questions, different submission format
|
| 12 |
+
- API routes, submission formats, scoring differences
|
| 13 |
+
- Development workflow for both
|
| 14 |
+
|
| 15 |
+
**Key Clarifications:**
|
| 16 |
+
| Aspect | Course | Official GAIA |
|
| 17 |
+
|--------|--------|--------------|
|
| 18 |
+
| API | `agents-course-unit4-scoring.hf.space` | `gaia-benchmark/leaderboard` Space |
|
| 19 |
+
| Questions | 20 (level 1) | 450+ (all levels) |
|
| 20 |
+
| Target | 30% (6/20) | Competitive placement |
|
| 21 |
+
| Debug features | Target Task IDs, Question Limit | Must submit ALL |
|
| 22 |
+
| Submission | JSON POST | File upload |
|
| 23 |
+
|
| 24 |
+
**Created Files:**
|
| 25 |
+
- **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
|
| 26 |
+
|
| 27 |
+
**Modified Files:**
|
| 28 |
+
- **README.md** - Added note linking to submission guide
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## [2026-01-12] [Feature] [COMPLETED] Target Specific Task IDs
|
| 33 |
+
|
| 34 |
+
**Problem:** No way to run specific questions for debugging. Had to run full evaluation or use "first N" limit, which is inefficient for targeted fixes.
|
| 35 |
+
|
| 36 |
+
**Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
|
| 37 |
+
|
| 38 |
+
**Implementation:**
|
| 39 |
+
- Added `eval_task_ids` textbox in UI (line 763-770)
|
| 40 |
+
- Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
|
| 41 |
+
- Filtering logic: Parses comma-separated IDs, filters `questions_data`
|
| 42 |
+
- Shows missing IDs warning if task_id not found in dataset
|
| 43 |
+
- Overrides question_limit when provided
|
| 44 |
+
|
| 45 |
+
**Usage:**
|
| 46 |
+
```
|
| 47 |
+
Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
**Modified Files:**
|
| 51 |
+
- **app.py** (~30 lines added)
|
| 52 |
+
- UI: `eval_task_ids` textbox
|
| 53 |
+
- `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
|
| 54 |
+
- `run_button.click()`: Pass task_ids to function
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## [2026-01-12] [Bug Fix] [COMPLETED] Calculator Threading Issue
|
| 59 |
+
|
| 60 |
+
**Problem:** Calculator tool fails with `ValueError: signal only works in main thread of the main interpreter` when running in Gradio's ThreadPoolExecutor context.
|
| 61 |
+
|
| 62 |
+
**Root Cause:** `signal.alarm()` only works in the main thread. Our agent uses `ThreadPoolExecutor` for concurrent processing (max_workers=5).
|
| 63 |
+
|
| 64 |
+
**Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
|
| 65 |
+
|
| 66 |
+
**Modified Files:**
|
| 67 |
+
- **src/tools/calculator.py** (~15 lines modified)
|
| 68 |
+
- `timeout()` context manager: Try/except for signal.alarm() failure
|
| 69 |
+
- Logs warning when timeout protection disabled
|
| 70 |
+
- Gracefully handles Windows (AttributeError for SIGALRM)
|
| 71 |
+
|
| 72 |
+
---
|
| 73 |
+
|
| 74 |
+
## [2026-01-12] [Feature] [COMPLETED] System Error Field
|
| 75 |
+
|
| 76 |
+
**Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
|
| 77 |
+
|
| 78 |
+
**Solution:** Changed to boolean `system_error: yes/no` field:
|
| 79 |
+
- `system_error: yes` - Technical/system error from our code (don't submit)
|
| 80 |
+
- `system_error: no` - AI response (submit answer, even if wrong)
|
| 81 |
+
- Added `error_log` field with full error details for system errors
|
| 82 |
+
|
| 83 |
+
**Implementation:**
|
| 84 |
+
- `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
|
| 85 |
+
- Results table: "System Error" column (yes/no), "Error Log" column (when yes)
|
| 86 |
+
- JSON export: `system_error` field, `error_log` field (when system error)
|
| 87 |
+
- Submission logic: Only submit when `system_error == "no"`
|
| 88 |
+
|
| 89 |
+
**Modified Files:**
|
| 90 |
+
- **app.py** (~30 lines modified)
|
| 91 |
+
- `a_determine_status()`: Returns tuple instead of string
|
| 92 |
+
- `process_single_question()`: Uses new format, adds `error_log`
|
| 93 |
+
- Results table: "System Error" + "Error Log" columns
|
| 94 |
+
- `export_results_to_json()`: Include `system_error` and `error_log`
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## [2026-01-12] [Refactoring] [COMPLETED] Fallback UI Removal
|
| 99 |
+
|
| 100 |
+
**Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
|
| 101 |
+
|
| 102 |
+
**Solution:** Removed all fallback-related UI elements:
|
| 103 |
+
- Removed `enable_fallback_checkbox` from Test Question tab
|
| 104 |
+
- Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
|
| 105 |
+
- Removed `enable_fallback` parameter from `test_single_question()` function
|
| 106 |
+
- Removed `enable_fallback` parameter from `run_and_submit_all()` function
|
| 107 |
+
- Removed `ENABLE_LLM_FALLBACK` environment variable setting
|
| 108 |
+
- Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
|
| 109 |
+
|
| 110 |
+
**Modified Files:**
|
| 111 |
+
- **app.py** (~20 lines removed)
|
| 112 |
+
- Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
|
| 113 |
+
- Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
|
| 114 |
+
- Updated `test_button.click()` inputs to remove checkbox reference
|
| 115 |
+
- Updated `run_button.click()` inputs to remove checkbox reference
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
|
| 120 |
+
|
| 121 |
+
**Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
|
| 122 |
+
- 4 providers to test for each feature
|
| 123 |
+
- Complex debugging with multiple code paths
|
| 124 |
+
- Longer, less clear error messages
|
| 125 |
+
- Adding complexity without clear benefit
|
| 126 |
+
|
| 127 |
+
**Solution:** Archive fallback mechanism, use single provider only
|
| 128 |
+
- Removed fallback provider loop (Gemini → HF → Groq → Claude)
|
| 129 |
+
- Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
|
| 130 |
+
- If provider fails, error is raised immediately
|
| 131 |
+
- Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
|
| 132 |
+
|
| 133 |
+
**Benefits:**
|
| 134 |
+
- ✅ Reduced code complexity
|
| 135 |
+
- ✅ Faster debugging (one code path)
|
| 136 |
+
- ✅ Clearer error messages
|
| 137 |
+
- ✅ No double work on features
|
| 138 |
+
|
| 139 |
+
**Modified Files:**
|
| 140 |
+
- **src/agent/llm_client.py** (~25 lines removed)
|
| 141 |
+
- Simplified `_call_with_fallback()`: Removed fallback logic
|
| 142 |
+
- **dev/dev_260112_02_fallback_archived.md** (NEW)
|
| 143 |
+
- Archived fallback code documentation
|
| 144 |
+
- Migration guide for restoration if needed
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Search Results Not Being Extracted
|
| 149 |
+
|
| 150 |
+
**Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
|
| 151 |
+
|
| 152 |
+
**Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
|
| 153 |
+
```python
|
| 154 |
+
{"results": [...], "source": "tavily", "query": "...", "count": N}
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
**Solution:** Handle both dict formats in evidence extraction:
|
| 158 |
+
```python
|
| 159 |
+
if isinstance(result, dict):
|
| 160 |
+
if "answer" in result:
|
| 161 |
+
evidence.append(result["answer"]) # Vision tools
|
| 162 |
+
elif "results" in result:
|
| 163 |
+
# Format search results as readable text
|
| 164 |
+
results_list = result.get("results", [])
|
| 165 |
+
formatted = []
|
| 166 |
+
for r in results_list[:3]:
|
| 167 |
+
formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
|
| 168 |
+
evidence.append("\n\n".join(formatted)) # Search tools
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
**Modified Files:**
|
| 172 |
+
- **src/agent/graph.py** (~40 lines modified)
|
| 173 |
+
- Updated evidence extraction in primary path
|
| 174 |
+
- Updated evidence extraction in fallback path
|
| 175 |
+
|
| 176 |
+
**Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
|
| 177 |
+
|
| 178 |
+
**Summary of Fixes (Session 2026-01-12):**
|
| 179 |
+
1. ✅ File download from HF dataset (5/5 files)
|
| 180 |
+
2. ✅ Absolute paths from script location
|
| 181 |
+
3. ✅ Evidence formatting for vision tools (dict → answer)
|
| 182 |
+
4. ✅ Evidence formatting for search tools (dict → formatted text)
|
| 183 |
+
|
| 184 |
+
---
|
| 185 |
+
|
| 186 |
+
## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Dict Results Not Being Extracted
|
| 187 |
+
|
| 188 |
+
**Problem:** Chess vision question returned "Unable to answer" even though vision tool correctly extracted the chess position.
|
| 189 |
+
|
| 190 |
+
**Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
|
| 191 |
+
|
| 192 |
+
**Solution:** Extract 'answer' field from dict results before adding to evidence:
|
| 193 |
+
```python
|
| 194 |
+
# Before
|
| 195 |
+
evidence.append(f"[{tool_name}] {result}") # Dict → string representation
|
| 196 |
+
|
| 197 |
+
# After
|
| 198 |
+
if isinstance(result, dict) and "answer" in result:
|
| 199 |
+
evidence.append(result["answer"]) # Extract answer field
|
| 200 |
+
elif isinstance(result, str):
|
| 201 |
+
evidence.append(result)
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
**Modified Files:**
|
| 205 |
+
- **src/agent/graph.py** (~15 lines modified)
|
| 206 |
+
- Updated `execute_node()`: Extract 'answer' from dict results
|
| 207 |
+
- Fixed both primary and fallback execution paths
|
| 208 |
+
|
| 209 |
+
**Test Result:** Simple search questions now work. Chess question still fails due to vision tool extracting wrong turn indicator (w instead of b).
|
| 210 |
+
|
| 211 |
+
**Known Issue:** Vision tool extracts "w - - 0 1" (White's turn) but question asks for Black's move. Ground truth is "Rd5" (Black move), so FEN extraction may have error.
|
| 212 |
+
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## [2026-01-12] [File Download Fix] [COMPLETED] Absolute Path Fix - Vision Tool Now Works
|
| 216 |
+
|
| 217 |
+
**Problem:** Chess vision question returned "Unable to answer" even though file was downloaded successfully.
|
| 218 |
+
|
| 219 |
+
**Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
|
| 220 |
+
|
| 221 |
+
**Solution:** Return absolute paths from `download_task_file()`
|
| 222 |
+
- Changed: `target_path = os.path.join(save_dir, file_name)`
|
| 223 |
+
- To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
|
| 224 |
+
- Now tools can find files regardless of working directory
|
| 225 |
+
|
| 226 |
+
**Modified Files:**
|
| 227 |
+
- **app.py** (~3 lines modified)
|
| 228 |
+
- Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
|
| 229 |
+
|
| 230 |
+
**Test Result:** Vision tool now works with absolute path - correctly analyzes chess position
|
| 231 |
+
|
| 232 |
+
---
|
| 233 |
+
|
| 234 |
+
## [2026-01-12] [File Download Fix] [COMPLETED] GAIA File API Dead End - Switch to HF Dataset
|
| 235 |
+
|
| 236 |
+
**Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
|
| 237 |
+
|
| 238 |
+
**Investigation:**
|
| 239 |
+
- Checked API spec: Endpoint exists with proper documentation
|
| 240 |
+
- Tested download: HTTP 404 "No file path associated with task_id"
|
| 241 |
+
- Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
|
| 242 |
+
- Confirmed via Swagger UI: Same 404 error
|
| 243 |
+
|
| 244 |
+
**Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
|
| 245 |
+
|
| 246 |
+
**Solution:** Switch from evaluation API to GAIA dataset download
|
| 247 |
+
- Use `huggingface_hub.hf_hub_download()` to fetch files
|
| 248 |
+
- Download to `_cache/gaia_files/` (runtime cache)
|
| 249 |
+
- File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
|
| 250 |
+
- Added cache checking (reuse downloaded files)
|
| 251 |
+
|
| 252 |
+
**Files with attachments (5/20 questions):**
|
| 253 |
+
- `cca530fc`: Chess position image (.png)
|
| 254 |
+
- `99c9cc74`: Pie recipe audio (.mp3)
|
| 255 |
+
- `f918266a`: Python code (.py)
|
| 256 |
+
- `1f975693`: Calculus audio (.mp3)
|
| 257 |
+
- `7bd855d8`: Menu sales Excel (.xlsx)
|
| 258 |
+
|
| 259 |
+
**Modified Files:**
|
| 260 |
+
- **app.py** (~70 lines modified)
|
| 261 |
+
- Updated `download_task_file()`: Changed from evaluation API to HF dataset download
|
| 262 |
+
- Changed signature: `download_task_file(task_id, file_name, save_dir)`
|
| 263 |
+
- Added `huggingface_hub` import with cache checking
|
| 264 |
+
- Default directory: `_cache/gaia_files/` (runtime cache, not git)
|
| 265 |
+
- Flat file structure: `_cache/gaia_files/{file_name}`
|
| 266 |
+
- **app.py** (~5 lines modified)
|
| 267 |
+
- Updated `process_single_question()`: Pass `file_name` to download function
|
| 268 |
+
|
| 269 |
+
**Known Limitations:**
|
| 270 |
+
- Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
|
| 271 |
+
- `.mp3` audio files still unsupported
|
| 272 |
+
- `.py` code execution still unsupported
|
| 273 |
+
|
| 274 |
+
**Next Steps:**
|
| 275 |
+
1. Test new download implementation
|
| 276 |
+
2. Expand tool support for .mp3 (audio transcription)
|
| 277 |
+
3. Expand tool support for .py (code execution)
|
| 278 |
+
|
| 279 |
+
---
|
| 280 |
+
|
| 281 |
## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
|
| 282 |
|
| 283 |
**Problem:** Need to validate HF vision works before complex GAIA evaluation.
|
|
@@ -531,3 +531,168 @@ If Phase 0 reveals HF Inference API doesn't support vision:
|
|
| 531 |
2. Test simple vision API call with Phi-3.5-vision-instruct
|
| 532 |
3. Document working pattern or confirm API doesn't support vision
|
| 533 |
4. Decision gate: GO to Phase 1 or pivot to backup options
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 531 |
2. Test simple vision API call with Phi-3.5-vision-instruct
|
| 532 |
3. Document working pattern or confirm API doesn't support vision
|
| 533 |
4. Decision gate: GO to Phase 1 or pivot to backup options
|
| 534 |
+
|
| 535 |
+
---
|
| 536 |
+
|
| 537 |
+
## Phase 7: GAIA File Attachment Support
|
| 538 |
+
|
| 539 |
+
**Goal:** Enable agent to download and process file attachments from GAIA questions
|
| 540 |
+
|
| 541 |
+
**Problem:**
|
| 542 |
+
- Current code ignores `file_name` field in GAIA questions
|
| 543 |
+
- Files not downloaded from `GET /files/{task_id}` endpoint
|
| 544 |
+
- Vision/file parsing tools fail with placeholder `<provided_image_path>`
|
| 545 |
+
- ~40% of questions (8/20) fail due to missing file handling
|
| 546 |
+
|
| 547 |
+
### Root Cause
|
| 548 |
+
|
| 549 |
+
**GAIA Question Structure:**
|
| 550 |
+
```json
|
| 551 |
+
{
|
| 552 |
+
"task_id": "abc123",
|
| 553 |
+
"question": "What's in this image?",
|
| 554 |
+
"file_name": "chess.png", // NULL if no file
|
| 555 |
+
"file_path": "/files/abc123" // NULL if no file
|
| 556 |
+
}
|
| 557 |
+
```
|
| 558 |
+
|
| 559 |
+
**Current Code (app.py:249-290):**
|
| 560 |
+
```python
|
| 561 |
+
def process_single_question(agent, item, index, total):
|
| 562 |
+
task_id = item.get("task_id")
|
| 563 |
+
question_text = item.get("question")
|
| 564 |
+
# ❌ MISSING: Check file_name
|
| 565 |
+
# ❌ MISSING: Download file
|
| 566 |
+
# ❌ MISSING: Pass file_path to agent
|
| 567 |
+
|
| 568 |
+
submitted_answer = agent(question_text) # No file handling
|
| 569 |
+
```
|
| 570 |
+
|
| 571 |
+
**Result:** LLM generates `vision(image_path="<provided_image_path>")` → File not found error
|
| 572 |
+
|
| 573 |
+
### Solution Architecture
|
| 574 |
+
|
| 575 |
+
**Step 1: Add File Download Function**
|
| 576 |
+
|
| 577 |
+
```python
|
| 578 |
+
def download_task_file(task_id: str, save_dir: str = "input/") -> Optional[str]:
|
| 579 |
+
"""Download file attached to a GAIA question.
|
| 580 |
+
|
| 581 |
+
Args:
|
| 582 |
+
task_id: Question's task_id
|
| 583 |
+
save_dir: Directory to save file
|
| 584 |
+
|
| 585 |
+
Returns:
|
| 586 |
+
File path if downloaded, None if no file
|
| 587 |
+
"""
|
| 588 |
+
api_url = "https://agents-course-unit4-scoring.hf.space"
|
| 589 |
+
file_url = f"{api_url}/files/{task_id}"
|
| 590 |
+
|
| 591 |
+
response = requests.get(file_url, timeout=30)
|
| 592 |
+
response.raise_for_status()
|
| 593 |
+
|
| 594 |
+
# Get extension from Content-Type header
|
| 595 |
+
content_type = response.headers.get('Content-Type', '')
|
| 596 |
+
extension_map = {
|
| 597 |
+
'image/png': '.png',
|
| 598 |
+
'image/jpeg': '.jpg',
|
| 599 |
+
'application/pdf': '.pdf',
|
| 600 |
+
'text/csv': '.csv',
|
| 601 |
+
'application/json': '.json',
|
| 602 |
+
'application/vnd.ms-excel': '.xls',
|
| 603 |
+
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': '.xlsx',
|
| 604 |
+
}
|
| 605 |
+
extension = extension_map.get(content_type, '.file')
|
| 606 |
+
|
| 607 |
+
# Save file
|
| 608 |
+
Path(save_dir).mkdir(exist_ok=True)
|
| 609 |
+
file_path = f"{save_dir}{task_id}{extension}"
|
| 610 |
+
with open(file_path, 'wb') as f:
|
| 611 |
+
f.write(response.content)
|
| 612 |
+
|
| 613 |
+
return file_path
|
| 614 |
+
```
|
| 615 |
+
|
| 616 |
+
**Step 2: Modify Question Processing**
|
| 617 |
+
|
| 618 |
+
```python
|
| 619 |
+
def process_single_question(agent, item, index, total):
|
| 620 |
+
task_id = item.get("task_id")
|
| 621 |
+
question_text = item.get("question")
|
| 622 |
+
file_name = item.get("file_name") # ✅ NEW
|
| 623 |
+
|
| 624 |
+
# Download file if exists
|
| 625 |
+
file_path = None
|
| 626 |
+
if file_name:
|
| 627 |
+
file_path = download_task_file(task_id)
|
| 628 |
+
|
| 629 |
+
# Pass file info to agent
|
| 630 |
+
submitted_answer = agent(question_text, file_path=file_path) # ✅ NEW
|
| 631 |
+
```
|
| 632 |
+
|
| 633 |
+
**Step 3: Update LLM Context**
|
| 634 |
+
|
| 635 |
+
When file_path is provided, include it in the planning prompt:
|
| 636 |
+
```python
|
| 637 |
+
if file_path:
|
| 638 |
+
question_context = f"Question: {question_text}\nAttached file: {file_path}"
|
| 639 |
+
else:
|
| 640 |
+
question_context = question_text
|
| 641 |
+
```
|
| 642 |
+
|
| 643 |
+
### Implementation Steps
|
| 644 |
+
|
| 645 |
+
#### Step 7.1: Add File Download Function
|
| 646 |
+
|
| 647 |
+
- [ ] Create `download_task_file()` in app.py
|
| 648 |
+
- [ ] Handle Content-Type to extension mapping
|
| 649 |
+
- [ ] Handle 404 gracefully (no file for this task)
|
| 650 |
+
- [ ] Create input/ directory if not exists
|
| 651 |
+
|
| 652 |
+
#### Step 7.2: Modify Question Processing Loop
|
| 653 |
+
|
| 654 |
+
- [ ] Check `item.get("file_name")` in process_single_question
|
| 655 |
+
- [ ] Call download_task_file() if file_name exists
|
| 656 |
+
- [ ] Pass file_path to agent invocation
|
| 657 |
+
|
| 658 |
+
#### Step 7.3: Update Agent to Handle file_path
|
| 659 |
+
|
| 660 |
+
- [ ] Modify agent to accept optional file_path parameter
|
| 661 |
+
- [ ] Include file info in planning prompt
|
| 662 |
+
- [ ] Update tool selection to use real file paths
|
| 663 |
+
|
| 664 |
+
#### Step 7.4: Test File Handling
|
| 665 |
+
|
| 666 |
+
- [ ] Test with image question (chess position)
|
| 667 |
+
- [ ] Test with document question (Excel file)
|
| 668 |
+
- [ ] Verify no more `<provided_image_path>` errors
|
| 669 |
+
|
| 670 |
+
### Files to Modify
|
| 671 |
+
|
| 672 |
+
1. **app.py** (~80 lines added/modified)
|
| 673 |
+
- Add download_task_file() function
|
| 674 |
+
- Modify process_single_question() to handle files
|
| 675 |
+
- Add input/ directory creation
|
| 676 |
+
|
| 677 |
+
2. **src/agent/graph.py** (~20 lines)
|
| 678 |
+
- Update agent state to include file_path
|
| 679 |
+
- Pass file info to planning prompt
|
| 680 |
+
|
| 681 |
+
3. **.gitignore** (~2 lines)
|
| 682 |
+
- Add input/ to ignore downloaded files
|
| 683 |
+
|
| 684 |
+
### Success Criteria
|
| 685 |
+
|
| 686 |
+
- [ ] Image questions: Vision tool receives real file path
|
| 687 |
+
- [ ] Document questions: parse_file tool receives real file path
|
| 688 |
+
- [ ] No more `<provided_image_path>` errors
|
| 689 |
+
- [ ] Accuracy improves from 10% toward 30%+
|
| 690 |
+
|
| 691 |
+
### Expected Impact
|
| 692 |
+
|
| 693 |
+
| Before | After |
|
| 694 |
+
|--------|-------|
|
| 695 |
+
| 40% (8/20) fail with file errors | 0% file errors |
|
| 696 |
+
| Vision questions: All fail | Vision questions: Can work |
|
| 697 |
+
| Document questions: All fail | Document questions: Can work |
|
| 698 |
+
| Max accuracy: ~60% | Max accuracy: ~100% potential |
|
|
@@ -348,6 +348,8 @@ ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
|
|
| 348 |
|
| 349 |
**Test Coverage:** 99 passing tests (~2min 40sec runtime)
|
| 350 |
|
|
|
|
|
|
|
| 351 |
## Workflow
|
| 352 |
|
| 353 |
### Dev Record Workflow
|
|
|
|
| 348 |
|
| 349 |
**Test Coverage:** 99 passing tests (~2min 40sec runtime)
|
| 350 |
|
| 351 |
+
> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
|
| 352 |
+
|
| 353 |
## Workflow
|
| 354 |
|
| 355 |
### Dev Record Workflow
|
|
@@ -1,17 +1,18 @@
|
|
| 1 |
import os
|
| 2 |
import gradio as gr
|
| 3 |
import requests
|
| 4 |
-
import inspect
|
| 5 |
import pandas as pd
|
| 6 |
import logging
|
| 7 |
import json
|
| 8 |
import time
|
|
|
|
| 9 |
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 10 |
|
| 11 |
# Stage 1: Import GAIAAgent (LangGraph-based agent)
|
| 12 |
from src.agent import GAIAAgent
|
| 13 |
|
| 14 |
# Import ground truth comparison
|
|
|
|
| 15 |
from src.utils.ground_truth import get_ground_truth
|
| 16 |
|
| 17 |
# Configure logging
|
|
@@ -99,9 +100,14 @@ def export_results_to_json(
|
|
| 99 |
result_dict = {
|
| 100 |
"task_id": result.get("Task ID", "N/A"),
|
| 101 |
"question": result.get("Question", "N/A"),
|
|
|
|
| 102 |
"submitted_answer": result.get("Submitted Answer", "N/A"),
|
| 103 |
}
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
# Add correctness if available
|
| 106 |
if result.get("Correct?"):
|
| 107 |
result_dict["correct"] = (
|
|
@@ -201,7 +207,81 @@ def format_diagnostics(final_state: dict) -> str:
|
|
| 201 |
return "\n".join(diagnostics)
|
| 202 |
|
| 203 |
|
| 204 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
"""Test agent with a single question and return diagnostics."""
|
| 206 |
if not question or not question.strip():
|
| 207 |
return "Please enter a question.", "", check_api_keys()
|
|
@@ -209,11 +289,8 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
|
|
| 209 |
try:
|
| 210 |
# Set LLM provider from UI selection (overrides .env)
|
| 211 |
os.environ["LLM_PROVIDER"] = llm_provider.lower()
|
| 212 |
-
os.environ["ENABLE_LLM_FALLBACK"] = "true" if enable_fallback else "false"
|
| 213 |
|
| 214 |
-
logger.info(
|
| 215 |
-
f"UI Config: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
|
| 216 |
-
)
|
| 217 |
|
| 218 |
# Initialize agent
|
| 219 |
agent = GAIAAgent()
|
|
@@ -225,7 +302,7 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
|
|
| 225 |
final_state = agent.last_state or {}
|
| 226 |
|
| 227 |
# Format diagnostics with LLM provider info
|
| 228 |
-
provider_info = f"**LLM Provider:** {llm_provider}
|
| 229 |
diagnostics = provider_info + format_diagnostics(final_state)
|
| 230 |
api_status = check_api_keys()
|
| 231 |
|
|
@@ -246,12 +323,34 @@ def test_single_question(question: str, llm_provider: str, enable_fallback: bool
|
|
| 246 |
# Stage 6: Async processing with ThreadPoolExecutor
|
| 247 |
|
| 248 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
def process_single_question(agent, item, index, total):
|
| 250 |
"""Process single question with agent, return result with error handling.
|
|
|
|
| 251 |
|
| 252 |
Args:
|
| 253 |
agent: GAIAAgent instance
|
| 254 |
-
item: Question item dict with task_id and
|
| 255 |
index: Question index (0-based)
|
| 256 |
total: Total number of questions
|
| 257 |
|
|
@@ -260,40 +359,64 @@ def process_single_question(agent, item, index, total):
|
|
| 260 |
"""
|
| 261 |
task_id = item.get("task_id")
|
| 262 |
question_text = item.get("question")
|
|
|
|
| 263 |
|
| 264 |
if not task_id or question_text is None:
|
|
|
|
|
|
|
| 265 |
return {
|
| 266 |
"task_id": task_id,
|
| 267 |
"question": question_text,
|
| 268 |
-
"answer":
|
|
|
|
|
|
|
| 269 |
"error": True,
|
| 270 |
}
|
| 271 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 272 |
try:
|
| 273 |
logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
|
| 274 |
-
|
|
|
|
|
|
|
|
|
|
| 275 |
logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
|
| 276 |
|
|
|
|
| 277 |
return {
|
| 278 |
"task_id": task_id,
|
| 279 |
"question": question_text,
|
| 280 |
"answer": submitted_answer,
|
|
|
|
|
|
|
| 281 |
"error": False,
|
| 282 |
}
|
| 283 |
except Exception as e:
|
| 284 |
logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
|
|
|
|
|
|
|
| 285 |
return {
|
| 286 |
"task_id": task_id,
|
| 287 |
"question": question_text,
|
| 288 |
-
"answer":
|
|
|
|
|
|
|
| 289 |
"error": True,
|
| 290 |
}
|
| 291 |
|
| 292 |
|
| 293 |
def run_and_submit_all(
|
| 294 |
llm_provider: str,
|
| 295 |
-
enable_fallback: bool,
|
| 296 |
question_limit: int = 0,
|
|
|
|
| 297 |
profile: gr.OAuthProfile | None = None,
|
| 298 |
):
|
| 299 |
"""
|
|
@@ -302,8 +425,8 @@ def run_and_submit_all(
|
|
| 302 |
|
| 303 |
Args:
|
| 304 |
llm_provider: LLM provider to use
|
| 305 |
-
enable_fallback: Whether to enable fallback to other providers
|
| 306 |
question_limit: Limit number of questions (0 = process all)
|
|
|
|
| 307 |
profile: OAuth profile for HF login
|
| 308 |
"""
|
| 309 |
# Start execution timer
|
|
@@ -325,10 +448,7 @@ def run_and_submit_all(
|
|
| 325 |
|
| 326 |
# Set LLM provider from UI selection (overrides .env)
|
| 327 |
os.environ["LLM_PROVIDER"] = llm_provider.lower()
|
| 328 |
-
|
| 329 |
-
logger.info(
|
| 330 |
-
f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}, ENABLE_LLM_FALLBACK={enable_fallback}"
|
| 331 |
-
)
|
| 332 |
|
| 333 |
# 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
|
| 334 |
try:
|
|
@@ -366,6 +486,27 @@ def run_and_submit_all(
|
|
| 366 |
f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
|
| 367 |
)
|
| 368 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 369 |
print(f"Processing {len(questions_data)} questions.")
|
| 370 |
except requests.exceptions.RequestException as e:
|
| 371 |
print(f"Error fetching questions: {e}")
|
|
@@ -421,9 +562,16 @@ def run_and_submit_all(
|
|
| 421 |
result_entry = {
|
| 422 |
"Task ID": result["task_id"],
|
| 423 |
"Question": result["question"],
|
| 424 |
-
"
|
|
|
|
|
|
|
|
|
|
| 425 |
}
|
| 426 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 427 |
# Add ground truth data if available
|
| 428 |
if is_correct is not None:
|
| 429 |
result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
|
|
@@ -433,8 +581,8 @@ def run_and_submit_all(
|
|
| 433 |
|
| 434 |
results_log.append(result_entry)
|
| 435 |
|
| 436 |
-
# Add to submission payload if no error
|
| 437 |
-
if
|
| 438 |
answers_payload.append(
|
| 439 |
{"task_id": result["task_id"], "submitted_answer": result["answer"]}
|
| 440 |
)
|
|
@@ -575,11 +723,6 @@ with gr.Blocks() as demo:
|
|
| 575 |
value="HuggingFace",
|
| 576 |
info="Select which LLM to use for this test",
|
| 577 |
)
|
| 578 |
-
enable_fallback_checkbox = gr.Checkbox(
|
| 579 |
-
label="Enable Fallback",
|
| 580 |
-
value=False,
|
| 581 |
-
info="If enabled, falls back to other providers on failure",
|
| 582 |
-
)
|
| 583 |
|
| 584 |
test_button = gr.Button("Run Test", variant="primary")
|
| 585 |
|
|
@@ -601,7 +744,6 @@ with gr.Blocks() as demo:
|
|
| 601 |
inputs=[
|
| 602 |
test_question_input,
|
| 603 |
llm_provider_dropdown,
|
| 604 |
-
enable_fallback_checkbox,
|
| 605 |
],
|
| 606 |
outputs=[test_answer_output, test_diagnostics_output, test_api_status],
|
| 607 |
)
|
|
@@ -632,11 +774,6 @@ with gr.Blocks() as demo:
|
|
| 632 |
value="HuggingFace",
|
| 633 |
info="Select which LLM to use for all questions",
|
| 634 |
)
|
| 635 |
-
eval_enable_fallback_checkbox = gr.Checkbox(
|
| 636 |
-
label="Enable Fallback",
|
| 637 |
-
value=True,
|
| 638 |
-
info="Recommended: Enable fallback for production evaluation",
|
| 639 |
-
)
|
| 640 |
eval_question_limit = gr.Number(
|
| 641 |
label="Question Limit (Debug)",
|
| 642 |
value=0,
|
|
@@ -646,6 +783,15 @@ with gr.Blocks() as demo:
|
|
| 646 |
info="Limit questions for testing (0 = process all)",
|
| 647 |
)
|
| 648 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 649 |
run_button = gr.Button("Run Evaluation & Submit All Answers")
|
| 650 |
|
| 651 |
status_output = gr.Textbox(
|
|
@@ -660,8 +806,8 @@ with gr.Blocks() as demo:
|
|
| 660 |
fn=run_and_submit_all,
|
| 661 |
inputs=[
|
| 662 |
eval_llm_provider_dropdown,
|
| 663 |
-
eval_enable_fallback_checkbox,
|
| 664 |
eval_question_limit,
|
|
|
|
| 665 |
],
|
| 666 |
outputs=[status_output, results_table, export_output],
|
| 667 |
)
|
|
|
|
| 1 |
import os
|
| 2 |
import gradio as gr
|
| 3 |
import requests
|
|
|
|
| 4 |
import pandas as pd
|
| 5 |
import logging
|
| 6 |
import json
|
| 7 |
import time
|
| 8 |
+
from pathlib import Path
|
| 9 |
from concurrent.futures import ThreadPoolExecutor, as_completed
|
| 10 |
|
| 11 |
# Stage 1: Import GAIAAgent (LangGraph-based agent)
|
| 12 |
from src.agent import GAIAAgent
|
| 13 |
|
| 14 |
# Import ground truth comparison
|
| 15 |
+
|
| 16 |
from src.utils.ground_truth import get_ground_truth
|
| 17 |
|
| 18 |
# Configure logging
|
|
|
|
| 100 |
result_dict = {
|
| 101 |
"task_id": result.get("Task ID", "N/A"),
|
| 102 |
"question": result.get("Question", "N/A"),
|
| 103 |
+
"system_error": result.get("System Error", "no"),
|
| 104 |
"submitted_answer": result.get("Submitted Answer", "N/A"),
|
| 105 |
}
|
| 106 |
|
| 107 |
+
# Add error log if system error
|
| 108 |
+
if result.get("System Error") == "yes" and result.get("Error Log"):
|
| 109 |
+
result_dict["error_log"] = result.get("Error Log")
|
| 110 |
+
|
| 111 |
# Add correctness if available
|
| 112 |
if result.get("Correct?"):
|
| 113 |
result_dict["correct"] = (
|
|
|
|
| 207 |
return "\n".join(diagnostics)
|
| 208 |
|
| 209 |
|
| 210 |
+
def download_task_file(
|
| 211 |
+
task_id: str, file_name: str, save_dir: str = "_cache/gaia_files/"
|
| 212 |
+
):
|
| 213 |
+
"""Download file attached to a GAIA question from the GAIA dataset on HuggingFace.
|
| 214 |
+
|
| 215 |
+
The evaluation API's /files/{task_id} endpoint returns 404 because files are not
|
| 216 |
+
hosted there. Files must be downloaded from the official GAIA dataset instead.
|
| 217 |
+
|
| 218 |
+
Files are cached in _cache/ directory (runtime cache, not in git).
|
| 219 |
+
|
| 220 |
+
Args:
|
| 221 |
+
task_id: Question's task_id (used for logging)
|
| 222 |
+
file_name: Original file name from API (e.g., "task_id.png")
|
| 223 |
+
save_dir: Directory to save file (created if not exists)
|
| 224 |
+
|
| 225 |
+
Returns:
|
| 226 |
+
File path if downloaded successfully, None if download failed
|
| 227 |
+
"""
|
| 228 |
+
import shutil
|
| 229 |
+
from huggingface_hub import hf_hub_download
|
| 230 |
+
import tempfile
|
| 231 |
+
|
| 232 |
+
# GAIA dataset file structure: 2023/validation/{task_id}.{ext}
|
| 233 |
+
# Extract file extension from file_name
|
| 234 |
+
_, ext = os.path.splitext(file_name)
|
| 235 |
+
ext = ext.lower()
|
| 236 |
+
|
| 237 |
+
# Try validation set first (most questions are from validation)
|
| 238 |
+
repo_id = "gaia-benchmark/GAIA"
|
| 239 |
+
possible_paths = [
|
| 240 |
+
f"2023/validation/{task_id}{ext}",
|
| 241 |
+
f"2023/test/{task_id}{ext}",
|
| 242 |
+
]
|
| 243 |
+
|
| 244 |
+
# Create save directory if not exists (relative to script location)
|
| 245 |
+
# Use script's directory as base to ensure paths work in all environments (local, HF Space)
|
| 246 |
+
script_dir = Path(__file__).parent.absolute()
|
| 247 |
+
cache_dir = script_dir / save_dir
|
| 248 |
+
cache_dir.mkdir(exist_ok=True, parents=True)
|
| 249 |
+
target_path = str(cache_dir / file_name)
|
| 250 |
+
|
| 251 |
+
# Check if file already exists in cache (use absolute path for check)
|
| 252 |
+
if os.path.exists(target_path):
|
| 253 |
+
logger.info(f"Using cached file for {task_id}: {target_path}")
|
| 254 |
+
return target_path
|
| 255 |
+
|
| 256 |
+
# Try each possible path
|
| 257 |
+
for dataset_path in possible_paths:
|
| 258 |
+
try:
|
| 259 |
+
logger.info(f"Attempting to download {dataset_path} from GAIA dataset...")
|
| 260 |
+
|
| 261 |
+
# Download to temp dir first to get the file
|
| 262 |
+
with tempfile.TemporaryDirectory() as temp_dir:
|
| 263 |
+
downloaded_path = hf_hub_download(
|
| 264 |
+
repo_id=repo_id,
|
| 265 |
+
filename=dataset_path,
|
| 266 |
+
repo_type="dataset",
|
| 267 |
+
local_dir=temp_dir,
|
| 268 |
+
)
|
| 269 |
+
|
| 270 |
+
# Copy file to target location (flat structure in cache)
|
| 271 |
+
shutil.copy(downloaded_path, target_path)
|
| 272 |
+
|
| 273 |
+
logger.info(f"Downloaded file for {task_id}: {target_path}")
|
| 274 |
+
return target_path
|
| 275 |
+
|
| 276 |
+
except Exception as e:
|
| 277 |
+
logger.debug(f"Path {dataset_path} not found: {e}")
|
| 278 |
+
continue
|
| 279 |
+
|
| 280 |
+
logger.warning(f"File not found in GAIA dataset for task {task_id}")
|
| 281 |
+
return None
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
def test_single_question(question: str, llm_provider: str):
|
| 285 |
"""Test agent with a single question and return diagnostics."""
|
| 286 |
if not question or not question.strip():
|
| 287 |
return "Please enter a question.", "", check_api_keys()
|
|
|
|
| 289 |
try:
|
| 290 |
# Set LLM provider from UI selection (overrides .env)
|
| 291 |
os.environ["LLM_PROVIDER"] = llm_provider.lower()
|
|
|
|
| 292 |
|
| 293 |
+
logger.info(f"UI Config: LLM_PROVIDER={llm_provider}")
|
|
|
|
|
|
|
| 294 |
|
| 295 |
# Initialize agent
|
| 296 |
agent = GAIAAgent()
|
|
|
|
| 302 |
final_state = agent.last_state or {}
|
| 303 |
|
| 304 |
# Format diagnostics with LLM provider info
|
| 305 |
+
provider_info = f"**LLM Provider:** {llm_provider}\n\n"
|
| 306 |
diagnostics = provider_info + format_diagnostics(final_state)
|
| 307 |
api_status = check_api_keys()
|
| 308 |
|
|
|
|
| 323 |
# Stage 6: Async processing with ThreadPoolExecutor
|
| 324 |
|
| 325 |
|
| 326 |
+
def a_determine_status(answer: str) -> tuple[bool, str | None]:
|
| 327 |
+
"""Determine if response is system error or AI answer.
|
| 328 |
+
|
| 329 |
+
Returns:
|
| 330 |
+
(is_system_error, error_log)
|
| 331 |
+
- is_system_error: True if system error, False if AI answer
|
| 332 |
+
- error_log: Full error message if system error, None otherwise
|
| 333 |
+
"""
|
| 334 |
+
if not answer:
|
| 335 |
+
return True, "Empty answer"
|
| 336 |
+
|
| 337 |
+
answer_lower = answer.lower().strip()
|
| 338 |
+
|
| 339 |
+
# System/technical errors from our code
|
| 340 |
+
if answer_lower.startswith("error:") or "no evidence collected" in answer_lower:
|
| 341 |
+
return True, answer # Full error message as log
|
| 342 |
+
|
| 343 |
+
# Everything else is an AI response (including "Unable to answer")
|
| 344 |
+
return False, None
|
| 345 |
+
|
| 346 |
+
|
| 347 |
def process_single_question(agent, item, index, total):
|
| 348 |
"""Process single question with agent, return result with error handling.
|
| 349 |
+
Supports file attachments - downloads files before processing.
|
| 350 |
|
| 351 |
Args:
|
| 352 |
agent: GAIAAgent instance
|
| 353 |
+
item: Question item dict with task_id, question, and optional file_name
|
| 354 |
index: Question index (0-based)
|
| 355 |
total: Total number of questions
|
| 356 |
|
|
|
|
| 359 |
"""
|
| 360 |
task_id = item.get("task_id")
|
| 361 |
question_text = item.get("question")
|
| 362 |
+
file_name = item.get("file_name")
|
| 363 |
|
| 364 |
if not task_id or question_text is None:
|
| 365 |
+
answer = "ERROR: Missing task_id or question"
|
| 366 |
+
is_error, error_log = a_determine_status(answer)
|
| 367 |
return {
|
| 368 |
"task_id": task_id,
|
| 369 |
"question": question_text,
|
| 370 |
+
"answer": answer,
|
| 371 |
+
"system_error": "yes" if is_error else "no",
|
| 372 |
+
"error_log": error_log,
|
| 373 |
"error": True,
|
| 374 |
}
|
| 375 |
|
| 376 |
+
# Download file if question has attachment
|
| 377 |
+
file_path = None
|
| 378 |
+
if file_name:
|
| 379 |
+
file_path = download_task_file(task_id, file_name)
|
| 380 |
+
if file_path:
|
| 381 |
+
logger.info(f"[{index + 1}/{total}] File downloaded: {file_path}")
|
| 382 |
+
else:
|
| 383 |
+
logger.warning(f"[{index + 1}/{total}] File expected but not downloaded")
|
| 384 |
+
|
| 385 |
try:
|
| 386 |
logger.info(f"[{index + 1}/{total}] Processing {task_id[:8]}...")
|
| 387 |
+
|
| 388 |
+
# Pass file_path to agent if available
|
| 389 |
+
submitted_answer = agent(question_text, file_path=file_path)
|
| 390 |
+
|
| 391 |
logger.info(f"[{index + 1}/{total}] Completed {task_id[:8]}")
|
| 392 |
|
| 393 |
+
is_error, error_log = a_determine_status(submitted_answer)
|
| 394 |
return {
|
| 395 |
"task_id": task_id,
|
| 396 |
"question": question_text,
|
| 397 |
"answer": submitted_answer,
|
| 398 |
+
"system_error": "yes" if is_error else "no",
|
| 399 |
+
"error_log": error_log,
|
| 400 |
"error": False,
|
| 401 |
}
|
| 402 |
except Exception as e:
|
| 403 |
logger.error(f"[{index + 1}/{total}] Error {task_id[:8]}: {e}")
|
| 404 |
+
answer = f"ERROR: {str(e)}"
|
| 405 |
+
is_error, error_log = a_determine_status(answer)
|
| 406 |
return {
|
| 407 |
"task_id": task_id,
|
| 408 |
"question": question_text,
|
| 409 |
+
"answer": answer,
|
| 410 |
+
"system_error": "yes" if is_error else "no",
|
| 411 |
+
"error_log": error_log,
|
| 412 |
"error": True,
|
| 413 |
}
|
| 414 |
|
| 415 |
|
| 416 |
def run_and_submit_all(
|
| 417 |
llm_provider: str,
|
|
|
|
| 418 |
question_limit: int = 0,
|
| 419 |
+
task_ids: str = "",
|
| 420 |
profile: gr.OAuthProfile | None = None,
|
| 421 |
):
|
| 422 |
"""
|
|
|
|
| 425 |
|
| 426 |
Args:
|
| 427 |
llm_provider: LLM provider to use
|
|
|
|
| 428 |
question_limit: Limit number of questions (0 = process all)
|
| 429 |
+
task_ids: Comma-separated task IDs to target (overrides question_limit)
|
| 430 |
profile: OAuth profile for HF login
|
| 431 |
"""
|
| 432 |
# Start execution timer
|
|
|
|
| 448 |
|
| 449 |
# Set LLM provider from UI selection (overrides .env)
|
| 450 |
os.environ["LLM_PROVIDER"] = llm_provider.lower()
|
| 451 |
+
logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
|
|
|
|
|
|
|
|
|
|
| 452 |
|
| 453 |
# 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
|
| 454 |
try:
|
|
|
|
| 486 |
f"DEBUG MODE: Processing only {limit} questions (set to 0 to process all)"
|
| 487 |
)
|
| 488 |
|
| 489 |
+
# Filter by specific task IDs if provided (overrides question limit)
|
| 490 |
+
if task_ids and task_ids.strip():
|
| 491 |
+
target_ids = [tid.strip() for tid in task_ids.split(",")]
|
| 492 |
+
original_count = len(questions_data)
|
| 493 |
+
questions_data = [
|
| 494 |
+
q for q in questions_data if q.get("task_id") in target_ids
|
| 495 |
+
]
|
| 496 |
+
found_ids = [q.get("task_id") for q in questions_data]
|
| 497 |
+
missing_ids = set(target_ids) - set(found_ids)
|
| 498 |
+
|
| 499 |
+
if missing_ids:
|
| 500 |
+
logger.warning(f"Task IDs not found: {missing_ids}")
|
| 501 |
+
|
| 502 |
+
logger.warning(
|
| 503 |
+
f"DEBUG MODE: Targeted {len(questions_data)}/{original_count} questions by task_id"
|
| 504 |
+
)
|
| 505 |
+
print(
|
| 506 |
+
f"DEBUG MODE: Processing {len(questions_data)} targeted questions "
|
| 507 |
+
f"({len(missing_ids)} IDs not found: {missing_ids})"
|
| 508 |
+
)
|
| 509 |
+
|
| 510 |
print(f"Processing {len(questions_data)} questions.")
|
| 511 |
except requests.exceptions.RequestException as e:
|
| 512 |
print(f"Error fetching questions: {e}")
|
|
|
|
| 562 |
result_entry = {
|
| 563 |
"Task ID": result["task_id"],
|
| 564 |
"Question": result["question"],
|
| 565 |
+
"System Error": result.get("system_error", "no"),
|
| 566 |
+
"Submitted Answer": ""
|
| 567 |
+
if result.get("system_error") == "yes"
|
| 568 |
+
else result["answer"],
|
| 569 |
}
|
| 570 |
|
| 571 |
+
# Add error log if system error
|
| 572 |
+
if result.get("system_error") == "yes" and result.get("error_log"):
|
| 573 |
+
result_entry["Error Log"] = result["error_log"]
|
| 574 |
+
|
| 575 |
# Add ground truth data if available
|
| 576 |
if is_correct is not None:
|
| 577 |
result_entry["Correct?"] = "✅ Yes" if is_correct else "❌ No"
|
|
|
|
| 581 |
|
| 582 |
results_log.append(result_entry)
|
| 583 |
|
| 584 |
+
# Add to submission payload if no system error
|
| 585 |
+
if result.get("system_error") == "no":
|
| 586 |
answers_payload.append(
|
| 587 |
{"task_id": result["task_id"], "submitted_answer": result["answer"]}
|
| 588 |
)
|
|
|
|
| 723 |
value="HuggingFace",
|
| 724 |
info="Select which LLM to use for this test",
|
| 725 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 726 |
|
| 727 |
test_button = gr.Button("Run Test", variant="primary")
|
| 728 |
|
|
|
|
| 744 |
inputs=[
|
| 745 |
test_question_input,
|
| 746 |
llm_provider_dropdown,
|
|
|
|
| 747 |
],
|
| 748 |
outputs=[test_answer_output, test_diagnostics_output, test_api_status],
|
| 749 |
)
|
|
|
|
| 774 |
value="HuggingFace",
|
| 775 |
info="Select which LLM to use for all questions",
|
| 776 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 777 |
eval_question_limit = gr.Number(
|
| 778 |
label="Question Limit (Debug)",
|
| 779 |
value=0,
|
|
|
|
| 783 |
info="Limit questions for testing (0 = process all)",
|
| 784 |
)
|
| 785 |
|
| 786 |
+
with gr.Row():
|
| 787 |
+
eval_task_ids = gr.Textbox(
|
| 788 |
+
label="Target Task IDs (Debug)",
|
| 789 |
+
value="",
|
| 790 |
+
placeholder="task_id1, task_id2, ...",
|
| 791 |
+
info="Comma-separated task IDs to run (overrides question limit)",
|
| 792 |
+
lines=1,
|
| 793 |
+
)
|
| 794 |
+
|
| 795 |
run_button = gr.Button("Run Evaluation & Submit All Answers")
|
| 796 |
|
| 797 |
status_output = gr.Textbox(
|
|
|
|
| 806 |
fn=run_and_submit_all,
|
| 807 |
inputs=[
|
| 808 |
eval_llm_provider_dropdown,
|
|
|
|
| 809 |
eval_question_limit,
|
| 810 |
+
eval_task_ids,
|
| 811 |
],
|
| 812 |
outputs=[status_output, results_table, export_output],
|
| 813 |
)
|
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAIA Submission Guide
|
| 2 |
+
|
| 3 |
+
## Two Different Leaderboards
|
| 4 |
+
|
| 5 |
+
### 1. Course Leaderboard (CURRENT - Course Assignment)
|
| 6 |
+
|
| 7 |
+
**API Endpoint:** `https://agents-course-unit4-scoring.hf.space`
|
| 8 |
+
|
| 9 |
+
**Purpose:** Hugging Face Agents Course Unit 4 assignment
|
| 10 |
+
|
| 11 |
+
**Dataset:** 20 questions from GAIA validation set (level 1), filtered by tools/steps complexity
|
| 12 |
+
|
| 13 |
+
**Target Score:** 30% = **6/20 correct**
|
| 14 |
+
|
| 15 |
+
**API Routes:**
|
| 16 |
+
- `GET /questions` - Retrieve full list of evaluation questions
|
| 17 |
+
- `GET /random-question` - Fetch single random question
|
| 18 |
+
- `GET /files/{task_id}` - Download file associated with task
|
| 19 |
+
- `POST /submit` - Submit answers for scoring
|
| 20 |
+
|
| 21 |
+
**Submission Format:**
|
| 22 |
+
```json
|
| 23 |
+
{
|
| 24 |
+
"username": "your-hf-username",
|
| 25 |
+
"agent_code": "https://huggingface.co/spaces/your-username/your-space/tree/main",
|
| 26 |
+
"answers": [
|
| 27 |
+
{"task_id": "...", "submitted_answer": "..."}
|
| 28 |
+
]
|
| 29 |
+
}
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
**Scoring:** EXACT MATCH with ground truth
|
| 33 |
+
- Answer should be plain text, NO "FINAL ANSWER:" prefix
|
| 34 |
+
- Answer should be precise and well-formatted
|
| 35 |
+
|
| 36 |
+
**Debugging Features (Course-Specific):**
|
| 37 |
+
- ✅ "Target Task IDs" - Run specific questions for debugging
|
| 38 |
+
- ✅ "Question Limit" - Run first N questions for testing
|
| 39 |
+
- ✅ Course API is forgiving for development iteration
|
| 40 |
+
|
| 41 |
+
**Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
### 2. Official GAIA Leaderboard (FUTURE - Not Yet Implemented)
|
| 46 |
+
|
| 47 |
+
**Space:** https://huggingface.co/spaces/gaia-benchmark/leaderboard
|
| 48 |
+
|
| 49 |
+
**Purpose:** Official GAIA benchmark for AI research community
|
| 50 |
+
|
| 51 |
+
**Dataset:** Full GAIA benchmark (450+ questions across 3 levels)
|
| 52 |
+
|
| 53 |
+
**Submission Format:** File upload (JSON) with model metadata
|
| 54 |
+
- Model name, family, parameters
|
| 55 |
+
- Complete answers for ALL questions
|
| 56 |
+
- Different evaluation process
|
| 57 |
+
|
| 58 |
+
**Status:** ⚠️ **FUTURE DEVELOPMENT** - Not implemented in this template
|
| 59 |
+
|
| 60 |
+
**Differences from Course:**
|
| 61 |
+
| Aspect | Course | Official GAIA |
|
| 62 |
+
|--------|--------|--------------|
|
| 63 |
+
| Dataset Size | 20 questions | 450+ questions |
|
| 64 |
+
| Submission Method | API POST | File upload |
|
| 65 |
+
| Question Filtering | Allowed for debugging | Must submit ALL |
|
| 66 |
+
| Scoring | Exact match | TBC (likely more flexible) |
|
| 67 |
+
|
| 68 |
+
**Documentation:** https://huggingface.co/datasets/gaia-benchmark/GAIA
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Implementation Notes
|
| 73 |
+
|
| 74 |
+
### Current Implementation Status
|
| 75 |
+
|
| 76 |
+
**✅ Implemented:**
|
| 77 |
+
- Course API integration (`/questions`, `/submit`, `/files/{task_id}`)
|
| 78 |
+
- Agent execution with LangGraph StateGraph
|
| 79 |
+
- OAuth login integration
|
| 80 |
+
- Debug features (Target Task IDs, Question Limit)
|
| 81 |
+
- Results export (JSON format)
|
| 82 |
+
|
| 83 |
+
**⚠️ Course Constraints:**
|
| 84 |
+
- Only 20 level 1 questions
|
| 85 |
+
- Exact match scoring (strict)
|
| 86 |
+
- Agent code must be public
|
| 87 |
+
|
| 88 |
+
**🔮 Future Work (Official GAIA):**
|
| 89 |
+
- File-based submission format
|
| 90 |
+
- Full 450+ question support
|
| 91 |
+
- Leaderboard-specific metadata
|
| 92 |
+
- Official evaluation pipeline
|
| 93 |
+
|
| 94 |
+
---
|
| 95 |
+
|
| 96 |
+
## Development Workflow
|
| 97 |
+
|
| 98 |
+
### For Course Assignment:
|
| 99 |
+
|
| 100 |
+
1. **Develop:** Use "Target Task IDs" to test specific questions
|
| 101 |
+
2. **Debug:** Use "Question Limit" for quick iteration
|
| 102 |
+
3. **Test:** Run full evaluation on all 20 questions
|
| 103 |
+
4. **Submit:** Course API evaluates exact match score
|
| 104 |
+
5. **Iterate:** Improve prompts, tools, reasoning
|
| 105 |
+
|
| 106 |
+
### For Official GAIA (Future):
|
| 107 |
+
|
| 108 |
+
1. **Generate:** Create submission JSON with all 450+ answers
|
| 109 |
+
2. **Format:** Follow official GAIA format requirements
|
| 110 |
+
3. **Upload:** Submit via gaia-benchmark/leaderboard Space
|
| 111 |
+
4. **Evaluate:** Official benchmark evaluation
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## References
|
| 116 |
+
|
| 117 |
+
- **Course Documentation:** https://huggingface.co/learn/agents-course/en/unit4/hands-on
|
| 118 |
+
- **Course Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/gaia-leaderboard
|
| 119 |
+
- **Official GAIA Dataset:** https://huggingface.co/datasets/gaia-benchmark/GAIA
|
| 120 |
+
- **Official GAIA Leaderboard:** https://huggingface.co/spaces/gaia-benchmark/leaderboard
|
|
@@ -15,6 +15,7 @@ Based on:
|
|
| 15 |
|
| 16 |
import logging
|
| 17 |
import os
|
|
|
|
| 18 |
from typing import TypedDict, List, Optional
|
| 19 |
from langgraph.graph import StateGraph, END
|
| 20 |
from src.config import Settings
|
|
@@ -100,9 +101,12 @@ def validate_environment() -> List[str]:
|
|
| 100 |
# ============================================================================
|
| 101 |
|
| 102 |
|
| 103 |
-
def fallback_tool_selection(
|
|
|
|
|
|
|
| 104 |
"""
|
| 105 |
MVP Fallback: Simple keyword-based tool selection when LLM fails.
|
|
|
|
| 106 |
|
| 107 |
This is a temporary hack to get basic functionality working.
|
| 108 |
Uses simple keyword matching to select tools.
|
|
@@ -110,6 +114,7 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
|
|
| 110 |
Args:
|
| 111 |
question: The user question
|
| 112 |
plan: The execution plan
|
|
|
|
| 113 |
|
| 114 |
Returns:
|
| 115 |
List of tool calls with basic parameters
|
|
@@ -147,17 +152,37 @@ def fallback_tool_selection(question: str, plan: str) -> List[dict]:
|
|
| 147 |
})
|
| 148 |
logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
|
| 149 |
|
| 150 |
-
# File tool:
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
# Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
|
| 157 |
image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
|
| 158 |
if any(keyword in combined for keyword in image_keywords):
|
| 159 |
-
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
if not tool_calls:
|
| 163 |
logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
|
|
@@ -256,7 +281,10 @@ def execute_node(state: AgentState) -> AgentState:
|
|
| 256 |
# Stage 3: Use LLM function calling to select tools and extract parameters
|
| 257 |
logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
|
| 258 |
tool_calls = select_tools_with_function_calling(
|
| 259 |
-
question=state["question"],
|
|
|
|
|
|
|
|
|
|
| 260 |
)
|
| 261 |
|
| 262 |
# Validate tool_calls result
|
|
@@ -264,13 +292,17 @@ def execute_node(state: AgentState) -> AgentState:
|
|
| 264 |
logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
|
| 265 |
state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
|
| 266 |
# MVP HACK: Use fallback keyword-based tool selection
|
| 267 |
-
tool_calls = fallback_tool_selection(
|
|
|
|
|
|
|
| 268 |
logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
|
| 269 |
elif not isinstance(tool_calls, list):
|
| 270 |
logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
|
| 271 |
state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
|
| 272 |
# MVP HACK: Use fallback
|
| 273 |
-
tool_calls = fallback_tool_selection(
|
|
|
|
|
|
|
| 274 |
else:
|
| 275 |
logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
|
| 276 |
logger.debug(f"[execute_node] Tool calls: {tool_calls}")
|
|
@@ -305,8 +337,32 @@ def execute_node(state: AgentState) -> AgentState:
|
|
| 305 |
}
|
| 306 |
)
|
| 307 |
|
| 308 |
-
# Extract evidence
|
| 309 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 310 |
|
| 311 |
except Exception as tool_error:
|
| 312 |
logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
|
|
@@ -342,7 +398,9 @@ def execute_node(state: AgentState) -> AgentState:
|
|
| 342 |
if not tool_calls:
|
| 343 |
logger.info(f"[execute_node] Attempting fallback after exception...")
|
| 344 |
try:
|
| 345 |
-
tool_calls = fallback_tool_selection(
|
|
|
|
|
|
|
| 346 |
logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
|
| 347 |
|
| 348 |
# Try to execute fallback tools
|
|
@@ -367,7 +425,28 @@ def execute_node(state: AgentState) -> AgentState:
|
|
| 367 |
"result": result,
|
| 368 |
"status": "success"
|
| 369 |
})
|
| 370 |
-
evidence
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 371 |
logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
|
| 372 |
except Exception as tool_error:
|
| 373 |
logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
|
|
@@ -504,22 +583,26 @@ class GAIAAgent:
|
|
| 504 |
self.last_state = None # Store last execution state for diagnostics
|
| 505 |
print("GAIAAgent initialized successfully")
|
| 506 |
|
| 507 |
-
def __call__(self, question: str) -> str:
|
| 508 |
"""
|
| 509 |
Process question and return answer.
|
|
|
|
| 510 |
|
| 511 |
Args:
|
| 512 |
question: GAIA question text
|
|
|
|
| 513 |
|
| 514 |
Returns:
|
| 515 |
Factoid answer string
|
| 516 |
"""
|
| 517 |
print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
|
|
|
|
|
|
|
| 518 |
|
| 519 |
# Initialize state
|
| 520 |
initial_state: AgentState = {
|
| 521 |
"question": question,
|
| 522 |
-
"file_paths": None,
|
| 523 |
"plan": None,
|
| 524 |
"tool_calls": [],
|
| 525 |
"tool_results": [],
|
|
|
|
| 15 |
|
| 16 |
import logging
|
| 17 |
import os
|
| 18 |
+
from pathlib import Path
|
| 19 |
from typing import TypedDict, List, Optional
|
| 20 |
from langgraph.graph import StateGraph, END
|
| 21 |
from src.config import Settings
|
|
|
|
| 101 |
# ============================================================================
|
| 102 |
|
| 103 |
|
| 104 |
+
def fallback_tool_selection(
|
| 105 |
+
question: str, plan: str, file_paths: Optional[List[str]] = None
|
| 106 |
+
) -> List[dict]:
|
| 107 |
"""
|
| 108 |
MVP Fallback: Simple keyword-based tool selection when LLM fails.
|
| 109 |
+
Enhanced to use actual file paths when available.
|
| 110 |
|
| 111 |
This is a temporary hack to get basic functionality working.
|
| 112 |
Uses simple keyword matching to select tools.
|
|
|
|
| 114 |
Args:
|
| 115 |
question: The user question
|
| 116 |
plan: The execution plan
|
| 117 |
+
file_paths: Optional list of downloaded file paths
|
| 118 |
|
| 119 |
Returns:
|
| 120 |
List of tool calls with basic parameters
|
|
|
|
| 152 |
})
|
| 153 |
logger.info(f"[fallback_tool_selection] Added calculator tool with expression: {expression}")
|
| 154 |
|
| 155 |
+
# File tool: if file_paths available, use them
|
| 156 |
+
if file_paths:
|
| 157 |
+
for file_path in file_paths:
|
| 158 |
+
# Determine file type and appropriate tool
|
| 159 |
+
file_ext = Path(file_path).suffix.lower()
|
| 160 |
+
if file_ext in ['.png', '.jpg', '.jpeg']:
|
| 161 |
+
tool_calls.append({
|
| 162 |
+
"tool": "vision",
|
| 163 |
+
"params": {"image_path": file_path}
|
| 164 |
+
})
|
| 165 |
+
logger.info(f"[fallback_tool_selection] Added vision tool for image: {file_path}")
|
| 166 |
+
elif file_ext in ['.pdf', '.xlsx', '.xls', '.csv', '.json', '.txt', '.docx', '.doc']:
|
| 167 |
+
tool_calls.append({
|
| 168 |
+
"tool": "parse_file",
|
| 169 |
+
"params": {"file_path": file_path}
|
| 170 |
+
})
|
| 171 |
+
logger.info(f"[fallback_tool_selection] Added parse_file tool for: {file_path}")
|
| 172 |
+
else:
|
| 173 |
+
# Keyword-based file detection (legacy)
|
| 174 |
+
file_keywords = ["file", "parse", "read", "csv", "json", "txt", "document"]
|
| 175 |
+
if any(keyword in combined for keyword in file_keywords):
|
| 176 |
+
logger.warning("[fallback_tool_selection] File operation detected but no file_paths available")
|
| 177 |
|
| 178 |
# Image tool: keywords like "image", "picture", "photo", "analyze", "vision"
|
| 179 |
image_keywords = ["image", "picture", "photo", "analyze image", "vision"]
|
| 180 |
if any(keyword in combined for keyword in image_keywords):
|
| 181 |
+
if file_paths:
|
| 182 |
+
# Already handled above in file_paths check
|
| 183 |
+
pass
|
| 184 |
+
else:
|
| 185 |
+
logger.warning("[fallback_tool_selection] Image operation detected but no file_paths available")
|
| 186 |
|
| 187 |
if not tool_calls:
|
| 188 |
logger.warning("[fallback_tool_selection] No tools selected by fallback - adding default search")
|
|
|
|
| 281 |
# Stage 3: Use LLM function calling to select tools and extract parameters
|
| 282 |
logger.info(f"[execute_node] Calling select_tools_with_function_calling()...")
|
| 283 |
tool_calls = select_tools_with_function_calling(
|
| 284 |
+
question=state["question"],
|
| 285 |
+
plan=state["plan"],
|
| 286 |
+
available_tools=TOOLS,
|
| 287 |
+
file_paths=state.get("file_paths"),
|
| 288 |
)
|
| 289 |
|
| 290 |
# Validate tool_calls result
|
|
|
|
| 292 |
logger.warning(f"[execute_node] ⚠ LLM returned empty tool_calls list - using fallback")
|
| 293 |
state["errors"].append("Tool selection returned no tools - using fallback keyword matching")
|
| 294 |
# MVP HACK: Use fallback keyword-based tool selection
|
| 295 |
+
tool_calls = fallback_tool_selection(
|
| 296 |
+
state["question"], state["plan"], state.get("file_paths")
|
| 297 |
+
)
|
| 298 |
logger.info(f"[execute_node] Fallback returned {len(tool_calls)} tool(s)")
|
| 299 |
elif not isinstance(tool_calls, list):
|
| 300 |
logger.error(f"[execute_node] ✗ Invalid tool_calls type: {type(tool_calls)} - using fallback")
|
| 301 |
state["errors"].append(f"Tool selection returned invalid type: {type(tool_calls)} - using fallback")
|
| 302 |
# MVP HACK: Use fallback
|
| 303 |
+
tool_calls = fallback_tool_selection(
|
| 304 |
+
state["question"], state["plan"], state.get("file_paths")
|
| 305 |
+
)
|
| 306 |
else:
|
| 307 |
logger.info(f"[execute_node] ✓ LLM selected {len(tool_calls)} tool(s)")
|
| 308 |
logger.debug(f"[execute_node] Tool calls: {tool_calls}")
|
|
|
|
| 337 |
}
|
| 338 |
)
|
| 339 |
|
| 340 |
+
# Extract evidence - handle different result formats
|
| 341 |
+
if isinstance(result, dict):
|
| 342 |
+
# Vision tool returns {"answer": "..."}
|
| 343 |
+
if "answer" in result:
|
| 344 |
+
evidence.append(result["answer"])
|
| 345 |
+
# Search tools return {"results": [...], "source": "...", "query": "..."}
|
| 346 |
+
elif "results" in result:
|
| 347 |
+
# Format search results as readable text
|
| 348 |
+
results_list = result.get("results", [])
|
| 349 |
+
if results_list:
|
| 350 |
+
# Take first 3 results and format them
|
| 351 |
+
formatted = []
|
| 352 |
+
for r in results_list[:3]:
|
| 353 |
+
title = r.get("title", "")[:100]
|
| 354 |
+
url = r.get("url", "")[:100]
|
| 355 |
+
snippet = r.get("snippet", "")[:200]
|
| 356 |
+
formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
|
| 357 |
+
evidence.append("\n\n".join(formatted))
|
| 358 |
+
else:
|
| 359 |
+
evidence.append(str(result))
|
| 360 |
+
else:
|
| 361 |
+
evidence.append(str(result))
|
| 362 |
+
elif isinstance(result, str):
|
| 363 |
+
evidence.append(result)
|
| 364 |
+
else:
|
| 365 |
+
evidence.append(str(result))
|
| 366 |
|
| 367 |
except Exception as tool_error:
|
| 368 |
logger.error(f"[execute_node] ✗ Tool {tool_name} failed: {type(tool_error).__name__}: {str(tool_error)}", exc_info=True)
|
|
|
|
| 398 |
if not tool_calls:
|
| 399 |
logger.info(f"[execute_node] Attempting fallback after exception...")
|
| 400 |
try:
|
| 401 |
+
tool_calls = fallback_tool_selection(
|
| 402 |
+
state["question"], state.get("plan", ""), state.get("file_paths")
|
| 403 |
+
)
|
| 404 |
logger.info(f"[execute_node] Fallback after exception returned {len(tool_calls)} tool(s)")
|
| 405 |
|
| 406 |
# Try to execute fallback tools
|
|
|
|
| 425 |
"result": result,
|
| 426 |
"status": "success"
|
| 427 |
})
|
| 428 |
+
# Extract evidence - handle different result formats
|
| 429 |
+
if isinstance(result, dict):
|
| 430 |
+
if "answer" in result:
|
| 431 |
+
evidence.append(result["answer"])
|
| 432 |
+
elif "results" in result:
|
| 433 |
+
results_list = result.get("results", [])
|
| 434 |
+
if results_list:
|
| 435 |
+
formatted = []
|
| 436 |
+
for r in results_list[:3]:
|
| 437 |
+
title = r.get("title", "")[:100]
|
| 438 |
+
url = r.get("url", "")[:100]
|
| 439 |
+
snippet = r.get("snippet", "")[:200]
|
| 440 |
+
formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
|
| 441 |
+
evidence.append("\n\n".join(formatted))
|
| 442 |
+
else:
|
| 443 |
+
evidence.append(str(result))
|
| 444 |
+
else:
|
| 445 |
+
evidence.append(str(result))
|
| 446 |
+
elif isinstance(result, str):
|
| 447 |
+
evidence.append(result)
|
| 448 |
+
else:
|
| 449 |
+
evidence.append(str(result))
|
| 450 |
logger.info(f"[execute_node] Fallback tool {tool_name} executed successfully")
|
| 451 |
except Exception as tool_error:
|
| 452 |
logger.error(f"[execute_node] Fallback tool {tool_name} failed: {tool_error}")
|
|
|
|
| 583 |
self.last_state = None # Store last execution state for diagnostics
|
| 584 |
print("GAIAAgent initialized successfully")
|
| 585 |
|
| 586 |
+
def __call__(self, question: str, file_path: Optional[str] = None) -> str:
|
| 587 |
"""
|
| 588 |
Process question and return answer.
|
| 589 |
+
Supports optional file attachment for file-based questions.
|
| 590 |
|
| 591 |
Args:
|
| 592 |
question: GAIA question text
|
| 593 |
+
file_path: Optional path to downloaded file attachment
|
| 594 |
|
| 595 |
Returns:
|
| 596 |
Factoid answer string
|
| 597 |
"""
|
| 598 |
print(f"GAIAAgent processing question (first 50 chars): {question[:50]}...")
|
| 599 |
+
if file_path:
|
| 600 |
+
print(f"GAIAAgent processing file: {file_path}")
|
| 601 |
|
| 602 |
# Initialize state
|
| 603 |
initial_state: AgentState = {
|
| 604 |
"question": question,
|
| 605 |
+
"file_paths": [file_path] if file_path else None,
|
| 606 |
"plan": None,
|
| 607 |
"tool_calls": [],
|
| 608 |
"tool_results": [],
|
|
@@ -158,7 +158,10 @@ def _get_provider_function(function_name: str, provider: str) -> Callable:
|
|
| 158 |
|
| 159 |
def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
|
| 160 |
"""
|
| 161 |
-
Call LLM function with configured provider
|
|
|
|
|
|
|
|
|
|
| 162 |
|
| 163 |
Args:
|
| 164 |
function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
|
|
@@ -168,55 +171,28 @@ def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
|
|
| 168 |
Result from LLM call
|
| 169 |
|
| 170 |
Raises:
|
| 171 |
-
Exception: If
|
| 172 |
"""
|
| 173 |
# Read config at runtime for UI flexibility
|
| 174 |
primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
|
| 175 |
-
enable_fallback = os.getenv("ENABLE_LLM_FALLBACK", "false").lower() == "true"
|
| 176 |
|
| 177 |
-
#
|
| 178 |
-
|
| 179 |
-
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
# Try primary provider
|
| 182 |
try:
|
| 183 |
primary_func = _get_provider_function(function_name, primary_provider)
|
| 184 |
-
logger.info(f"[{function_name}] Using
|
| 185 |
return retry_with_backoff(lambda: primary_func(*args, **kwargs))
|
| 186 |
except Exception as primary_error:
|
| 187 |
-
logger.
|
| 188 |
-
|
|
|
|
| 189 |
)
|
| 190 |
|
| 191 |
-
# If fallback disabled, raise immediately
|
| 192 |
-
if not enable_fallback:
|
| 193 |
-
logger.error(f"[{function_name}] Fallback disabled. Failing fast.")
|
| 194 |
-
raise Exception(
|
| 195 |
-
f"{function_name} failed with {primary_provider}: {primary_error}. "
|
| 196 |
-
f"Fallback disabled (ENABLE_LLM_FALLBACK=false)"
|
| 197 |
-
)
|
| 198 |
-
|
| 199 |
-
# Try fallback providers in order
|
| 200 |
-
errors = {primary_provider: primary_error}
|
| 201 |
-
for fallback_provider in fallback_providers:
|
| 202 |
-
try:
|
| 203 |
-
fallback_func = _get_provider_function(function_name, fallback_provider)
|
| 204 |
-
logger.info(
|
| 205 |
-
f"[{function_name}] Trying fallback provider: {fallback_provider}"
|
| 206 |
-
)
|
| 207 |
-
return retry_with_backoff(lambda: fallback_func(*args, **kwargs))
|
| 208 |
-
except Exception as fallback_error:
|
| 209 |
-
errors[fallback_provider] = fallback_error
|
| 210 |
-
logger.warning(
|
| 211 |
-
f"[{function_name}] Fallback provider {fallback_provider} failed: {fallback_error}"
|
| 212 |
-
)
|
| 213 |
-
continue
|
| 214 |
-
|
| 215 |
-
# All providers failed
|
| 216 |
-
error_summary = ", ".join([f"{k}: {v}" for k, v in errors.items()])
|
| 217 |
-
logger.error(f"[{function_name}] All providers failed. {error_summary}")
|
| 218 |
-
raise Exception(f"{function_name} failed with all providers. {error_summary}")
|
| 219 |
-
|
| 220 |
|
| 221 |
# ============================================================================
|
| 222 |
# Client Initialization
|
|
@@ -560,7 +536,7 @@ def plan_question(
|
|
| 560 |
|
| 561 |
|
| 562 |
def select_tools_claude(
|
| 563 |
-
question: str, plan: str, available_tools: Dict[str, Dict]
|
| 564 |
) -> List[Dict[str, Any]]:
|
| 565 |
"""Use Claude function calling to select tools and extract parameters."""
|
| 566 |
client = create_claude_client()
|
|
@@ -580,15 +556,28 @@ def select_tools_claude(
|
|
| 580 |
}
|
| 581 |
)
|
| 582 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 583 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 584 |
|
| 585 |
Few-shot examples:
|
| 586 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 587 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 588 |
-
- "Analyze the image at example.com/pic.jpg" → vision(
|
| 589 |
-
- "What's in the uploaded Excel file?" → parse_file(file_path="
|
| 590 |
|
| 591 |
Execute the plan step by step. Extract correct parameters from the question.
|
|
|
|
| 592 |
|
| 593 |
Plan:
|
| 594 |
{plan}"""
|
|
@@ -633,7 +622,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
|
|
| 633 |
|
| 634 |
|
| 635 |
def select_tools_gemini(
|
| 636 |
-
question: str, plan: str, available_tools: Dict[str, Dict]
|
| 637 |
) -> List[Dict[str, Any]]:
|
| 638 |
"""Use Gemini function calling to select tools and extract parameters."""
|
| 639 |
model = create_gemini_client()
|
|
@@ -665,15 +654,28 @@ def select_tools_gemini(
|
|
| 665 |
)
|
| 666 |
)
|
| 667 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 668 |
prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 669 |
|
| 670 |
Few-shot examples:
|
| 671 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 672 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 673 |
-
- "Analyze the image at example.com/pic.jpg" → vision(
|
| 674 |
-
- "What's in the uploaded Excel file?" → parse_file(file_path="
|
| 675 |
|
| 676 |
Execute the plan step by step. Extract correct parameters from the question.
|
|
|
|
| 677 |
|
| 678 |
Plan:
|
| 679 |
{plan}
|
|
@@ -718,7 +720,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
|
|
| 718 |
|
| 719 |
|
| 720 |
def select_tools_hf(
|
| 721 |
-
question: str, plan: str, available_tools: Dict[str, Dict]
|
| 722 |
) -> List[Dict[str, Any]]:
|
| 723 |
"""Use HuggingFace Inference API with function calling to select tools and extract parameters."""
|
| 724 |
client = create_hf_client()
|
|
@@ -748,15 +750,28 @@ def select_tools_hf(
|
|
| 748 |
|
| 749 |
tools.append(tool_schema)
|
| 750 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 751 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 752 |
|
| 753 |
Few-shot examples:
|
| 754 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 755 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 756 |
-
- "Analyze the image at example.com/pic.jpg" → vision(
|
| 757 |
-
- "What's in the uploaded Excel file?" → parse_file(file_path="
|
| 758 |
|
| 759 |
Execute the plan step by step. Extract correct parameters from the question.
|
|
|
|
| 760 |
|
| 761 |
Plan:
|
| 762 |
{plan}"""
|
|
@@ -766,7 +781,7 @@ Plan:
|
|
| 766 |
Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
|
| 767 |
|
| 768 |
logger.info(
|
| 769 |
-
f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools"
|
| 770 |
)
|
| 771 |
|
| 772 |
messages = [
|
|
@@ -807,7 +822,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
|
|
| 807 |
|
| 808 |
|
| 809 |
def select_tools_groq(
|
| 810 |
-
question: str, plan: str, available_tools: Dict[str, Dict]
|
| 811 |
) -> List[Dict[str, Any]]:
|
| 812 |
"""Use Groq with function calling to select tools and extract parameters."""
|
| 813 |
client = create_groq_client()
|
|
@@ -837,15 +852,28 @@ def select_tools_groq(
|
|
| 837 |
|
| 838 |
tools.append(tool_schema)
|
| 839 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 840 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 841 |
|
| 842 |
Few-shot examples:
|
| 843 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 844 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 845 |
-
- "Analyze the image at example.com/pic.jpg" → vision(
|
| 846 |
-
- "What's in the uploaded Excel file?" → parse_file(file_path="
|
| 847 |
|
| 848 |
Execute the plan step by step. Extract correct parameters from the question.
|
|
|
|
| 849 |
|
| 850 |
Plan:
|
| 851 |
{plan}"""
|
|
@@ -900,7 +928,7 @@ Select and call the tools needed according to the plan. Use exact parameter name
|
|
| 900 |
|
| 901 |
|
| 902 |
def select_tools_with_function_calling(
|
| 903 |
-
question: str, plan: str, available_tools: Dict[str, Dict]
|
| 904 |
) -> List[Dict[str, Any]]:
|
| 905 |
"""
|
| 906 |
Use LLM function calling to dynamically select tools and extract parameters.
|
|
@@ -913,11 +941,12 @@ def select_tools_with_function_calling(
|
|
| 913 |
question: GAIA question text
|
| 914 |
plan: Execution plan from planning phase
|
| 915 |
available_tools: Tool registry
|
|
|
|
| 916 |
|
| 917 |
Returns:
|
| 918 |
List of tool calls with extracted parameters
|
| 919 |
"""
|
| 920 |
-
return _call_with_fallback("select_tools", question, plan, available_tools)
|
| 921 |
|
| 922 |
|
| 923 |
# ============================================================================
|
|
|
|
| 158 |
|
| 159 |
def _call_with_fallback(function_name: str, *args, **kwargs) -> Any:
|
| 160 |
"""
|
| 161 |
+
Call LLM function with configured provider.
|
| 162 |
+
|
| 163 |
+
NOTE: Fallback mechanism has been archived to reduce complexity.
|
| 164 |
+
Only the primary provider is used. If it fails, the error is raised directly.
|
| 165 |
|
| 166 |
Args:
|
| 167 |
function_name: Base function name ("plan_question", "select_tools", "synthesize_answer")
|
|
|
|
| 171 |
Result from LLM call
|
| 172 |
|
| 173 |
Raises:
|
| 174 |
+
Exception: If primary provider fails
|
| 175 |
"""
|
| 176 |
# Read config at runtime for UI flexibility
|
| 177 |
primary_provider = os.getenv("LLM_PROVIDER", "gemini").lower()
|
|
|
|
| 178 |
|
| 179 |
+
# ============================================================================
|
| 180 |
+
# ARCHIVED: Fallback mechanism removed to reduce complexity
|
| 181 |
+
# Original fallback code was at: dev/dev_260112_02_fallback_archived.md
|
| 182 |
+
# To restore: Check git history or archived dev file
|
| 183 |
+
# ============================================================================
|
| 184 |
|
| 185 |
+
# Try primary provider only (no fallback)
|
| 186 |
try:
|
| 187 |
primary_func = _get_provider_function(function_name, primary_provider)
|
| 188 |
+
logger.info(f"[{function_name}] Using provider: {primary_provider}")
|
| 189 |
return retry_with_backoff(lambda: primary_func(*args, **kwargs))
|
| 190 |
except Exception as primary_error:
|
| 191 |
+
logger.error(f"[{function_name}] Provider {primary_provider} failed: {primary_error}")
|
| 192 |
+
raise Exception(
|
| 193 |
+
f"{function_name} failed with {primary_provider}: {primary_error}"
|
| 194 |
)
|
| 195 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 196 |
|
| 197 |
# ============================================================================
|
| 198 |
# Client Initialization
|
|
|
|
| 536 |
|
| 537 |
|
| 538 |
def select_tools_claude(
|
| 539 |
+
question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
|
| 540 |
) -> List[Dict[str, Any]]:
|
| 541 |
"""Use Claude function calling to select tools and extract parameters."""
|
| 542 |
client = create_claude_client()
|
|
|
|
| 556 |
}
|
| 557 |
)
|
| 558 |
|
| 559 |
+
# File context for tool selection
|
| 560 |
+
file_context = ""
|
| 561 |
+
if file_paths:
|
| 562 |
+
file_context = f"""
|
| 563 |
+
|
| 564 |
+
IMPORTANT: These files are available for this question:
|
| 565 |
+
{chr(10).join(f"- {fp}" for fp in file_paths)}
|
| 566 |
+
|
| 567 |
+
When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
|
| 568 |
+
For vision tools with images: vision(image_path="<actual_file_path>")
|
| 569 |
+
For file parsing tools: parse_file(file_path="<actual_file_path>")"""
|
| 570 |
+
|
| 571 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 572 |
|
| 573 |
Few-shot examples:
|
| 574 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 575 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 576 |
+
- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
|
| 577 |
+
- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
|
| 578 |
|
| 579 |
Execute the plan step by step. Extract correct parameters from the question.
|
| 580 |
+
Use actual file paths when files are provided.{file_context}
|
| 581 |
|
| 582 |
Plan:
|
| 583 |
{plan}"""
|
|
|
|
| 622 |
|
| 623 |
|
| 624 |
def select_tools_gemini(
|
| 625 |
+
question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
|
| 626 |
) -> List[Dict[str, Any]]:
|
| 627 |
"""Use Gemini function calling to select tools and extract parameters."""
|
| 628 |
model = create_gemini_client()
|
|
|
|
| 654 |
)
|
| 655 |
)
|
| 656 |
|
| 657 |
+
# File context for tool selection
|
| 658 |
+
file_context = ""
|
| 659 |
+
if file_paths:
|
| 660 |
+
file_context = f"""
|
| 661 |
+
|
| 662 |
+
IMPORTANT: These files are available for this question:
|
| 663 |
+
{chr(10).join(f"- {fp}" for fp in file_paths)}
|
| 664 |
+
|
| 665 |
+
When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
|
| 666 |
+
For vision tools with images: vision(image_path="<actual_file_path>")
|
| 667 |
+
For file parsing tools: parse_file(file_path="<actual_file_path>")"""
|
| 668 |
+
|
| 669 |
prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 670 |
|
| 671 |
Few-shot examples:
|
| 672 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 673 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 674 |
+
- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
|
| 675 |
+
- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
|
| 676 |
|
| 677 |
Execute the plan step by step. Extract correct parameters from the question.
|
| 678 |
+
Use actual file paths when files are provided.{file_context}
|
| 679 |
|
| 680 |
Plan:
|
| 681 |
{plan}
|
|
|
|
| 720 |
|
| 721 |
|
| 722 |
def select_tools_hf(
|
| 723 |
+
question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
|
| 724 |
) -> List[Dict[str, Any]]:
|
| 725 |
"""Use HuggingFace Inference API with function calling to select tools and extract parameters."""
|
| 726 |
client = create_hf_client()
|
|
|
|
| 750 |
|
| 751 |
tools.append(tool_schema)
|
| 752 |
|
| 753 |
+
# File context for tool selection
|
| 754 |
+
file_context = ""
|
| 755 |
+
if file_paths:
|
| 756 |
+
file_context = f"""
|
| 757 |
+
|
| 758 |
+
IMPORTANT: These files are available for this question:
|
| 759 |
+
{chr(10).join(f"- {fp}" for fp in file_paths)}
|
| 760 |
+
|
| 761 |
+
When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
|
| 762 |
+
For vision tools with images: vision(image_path="<actual_file_path>")
|
| 763 |
+
For file parsing tools: parse_file(file_path="<actual_file_path>")"""
|
| 764 |
+
|
| 765 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 766 |
|
| 767 |
Few-shot examples:
|
| 768 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 769 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 770 |
+
- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
|
| 771 |
+
- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
|
| 772 |
|
| 773 |
Execute the plan step by step. Extract correct parameters from the question.
|
| 774 |
+
Use actual file paths when files are provided.{file_context}
|
| 775 |
|
| 776 |
Plan:
|
| 777 |
{plan}"""
|
|
|
|
| 781 |
Select and call the tools needed according to the plan. Use exact parameter names from tool schemas."""
|
| 782 |
|
| 783 |
logger.info(
|
| 784 |
+
f"[select_tools_hf] Calling HuggingFace with function calling for {len(tools)} tools, file_paths={file_paths}"
|
| 785 |
)
|
| 786 |
|
| 787 |
messages = [
|
|
|
|
| 822 |
|
| 823 |
|
| 824 |
def select_tools_groq(
|
| 825 |
+
question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
|
| 826 |
) -> List[Dict[str, Any]]:
|
| 827 |
"""Use Groq with function calling to select tools and extract parameters."""
|
| 828 |
client = create_groq_client()
|
|
|
|
| 852 |
|
| 853 |
tools.append(tool_schema)
|
| 854 |
|
| 855 |
+
# File context for tool selection
|
| 856 |
+
file_context = ""
|
| 857 |
+
if file_paths:
|
| 858 |
+
file_context = f"""
|
| 859 |
+
|
| 860 |
+
IMPORTANT: These files are available for this question:
|
| 861 |
+
{chr(10).join(f"- {fp}" for fp in file_paths)}
|
| 862 |
+
|
| 863 |
+
When selecting tools, use the ACTUAL file paths listed above. Do NOT use placeholder paths like "<provided_path>" or "path_to_chess_image.jpg".
|
| 864 |
+
For vision tools with images: vision(image_path="<actual_file_path>")
|
| 865 |
+
For file parsing tools: parse_file(file_path="<actual_file_path>")"""
|
| 866 |
+
|
| 867 |
system_prompt = f"""You are a tool selection expert. Based on the question and execution plan, select appropriate tools with correct parameters.
|
| 868 |
|
| 869 |
Few-shot examples:
|
| 870 |
- "How many albums did The Beatles release?" → web_search(query="Beatles discography number of albums")
|
| 871 |
- "What is 25 * 37 + 100?" → calculator(expression="25 * 37 + 100")
|
| 872 |
+
- "Analyze the image at example.com/pic.jpg" → vision(image_path="example.com/pic.jpg")
|
| 873 |
+
- "What's in the uploaded Excel file?" → parse_file(file_path="actual_file.xlsx")
|
| 874 |
|
| 875 |
Execute the plan step by step. Extract correct parameters from the question.
|
| 876 |
+
Use actual file paths when files are provided.{file_context}
|
| 877 |
|
| 878 |
Plan:
|
| 879 |
{plan}"""
|
|
|
|
| 928 |
|
| 929 |
|
| 930 |
def select_tools_with_function_calling(
|
| 931 |
+
question: str, plan: str, available_tools: Dict[str, Dict], file_paths: Optional[List[str]] = None
|
| 932 |
) -> List[Dict[str, Any]]:
|
| 933 |
"""
|
| 934 |
Use LLM function calling to dynamically select tools and extract parameters.
|
|
|
|
| 941 |
question: GAIA question text
|
| 942 |
plan: Execution plan from planning phase
|
| 943 |
available_tools: Tool registry
|
| 944 |
+
file_paths: Optional list of downloaded file paths for file-based questions
|
| 945 |
|
| 946 |
Returns:
|
| 947 |
List of tool calls with extracted parameters
|
| 948 |
"""
|
| 949 |
+
return _call_with_fallback("select_tools", question, plan, available_tools, file_paths)
|
| 950 |
|
| 951 |
|
| 952 |
# ============================================================================
|
|
@@ -93,20 +93,33 @@ def timeout(seconds: int):
|
|
| 93 |
|
| 94 |
Raises:
|
| 95 |
TimeoutError: If execution exceeds timeout
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
"""
|
| 97 |
def timeout_handler(signum, frame):
|
| 98 |
raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
try:
|
| 105 |
yield
|
| 106 |
finally:
|
| 107 |
# Restore old handler and cancel alarm
|
| 108 |
-
|
| 109 |
-
|
|
|
|
| 110 |
|
| 111 |
|
| 112 |
# ============================================================================
|
|
|
|
| 93 |
|
| 94 |
Raises:
|
| 95 |
TimeoutError: If execution exceeds timeout
|
| 96 |
+
|
| 97 |
+
Note:
|
| 98 |
+
signal.alarm() only works in main thread. In threaded contexts
|
| 99 |
+
(Gradio, ThreadPoolExecutor), timeout protection is disabled.
|
| 100 |
"""
|
| 101 |
def timeout_handler(signum, frame):
|
| 102 |
raise TimeoutError(f"Evaluation exceeded {seconds} second timeout")
|
| 103 |
|
| 104 |
+
try:
|
| 105 |
+
# Set signal handler (only works in main thread)
|
| 106 |
+
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
|
| 107 |
+
signal.alarm(seconds)
|
| 108 |
+
_alarm_set = True
|
| 109 |
+
except (ValueError, AttributeError):
|
| 110 |
+
# ValueError: signal.alarm() in non-main thread
|
| 111 |
+
# AttributeError: signal.SIGALRM not available (Windows)
|
| 112 |
+
logger.warning(f"Timeout protection disabled (threading/Windows limitation)")
|
| 113 |
+
_alarm_set = False
|
| 114 |
+
old_handler = None
|
| 115 |
|
| 116 |
try:
|
| 117 |
yield
|
| 118 |
finally:
|
| 119 |
# Restore old handler and cancel alarm
|
| 120 |
+
if _alarm_set and old_handler is not None:
|
| 121 |
+
signal.alarm(0)
|
| 122 |
+
signal.signal(signal.SIGALRM, old_handler)
|
| 123 |
|
| 124 |
|
| 125 |
# ============================================================================
|
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Quick test script for specific GAIA questions.
|
| 4 |
+
Use this to verify fixes without running full evaluation.
|
| 5 |
+
|
| 6 |
+
Usage:
|
| 7 |
+
uv run python test/test_quick_fixes.py
|
| 8 |
+
"""
|
| 9 |
+
|
| 10 |
+
import os
|
| 11 |
+
import sys
|
| 12 |
+
|
| 13 |
+
# Add project root to path
|
| 14 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 15 |
+
|
| 16 |
+
from src.agent.graph import GAIAAgent
|
| 17 |
+
from dotenv import load_dotenv
|
| 18 |
+
|
| 19 |
+
# Load environment variables
|
| 20 |
+
load_dotenv()
|
| 21 |
+
|
| 22 |
+
# ============================================================================
|
| 23 |
+
# CONFIG - Questions to test
|
| 24 |
+
# ============================================================================
|
| 25 |
+
|
| 26 |
+
TEST_QUESTIONS = [
|
| 27 |
+
{
|
| 28 |
+
"task_id": "2d83110e-a098-4ebb-9987-066c06fa42d0",
|
| 29 |
+
"name": "Reverse sentence (calculator threading fix)",
|
| 30 |
+
"question": ".rewsna eht sa \"tfel\" drow eht fo etisoppo eht etirw ,ecnetnes siht dnatsrednu uoy fI",
|
| 31 |
+
"expected": "Right",
|
| 32 |
+
},
|
| 33 |
+
{
|
| 34 |
+
"task_id": "6f37996b-2ac7-44b0-8e68-6d28256631b4",
|
| 35 |
+
"name": "Table commutativity (LLM issue - table in question)",
|
| 36 |
+
"question": '''Given this table defining * on the set S = {a, b, c, d, e}
|
| 37 |
+
|
| 38 |
+
|*|a|b|c|d|e|
|
| 39 |
+
|---|---|---|---|---|
|
| 40 |
+
|a|a|b|c|b|d|
|
| 41 |
+
|b|b|c|a|e|c|
|
| 42 |
+
|c|c|a|b|b|a|
|
| 43 |
+
|d|b|e|b|e|d|
|
| 44 |
+
|e|d|b|a|d|c|
|
| 45 |
+
|
| 46 |
+
provide the subset of S involved in any possible counter-examples that prove * is not commutative. Provide your answer as a comma separated list of the elements in the set in alphabetical order.''',
|
| 47 |
+
"expected": "b, e",
|
| 48 |
+
},
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
# ============================================================================
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def test_question(agent: GAIAAgent, test_case: dict) -> dict:
|
| 55 |
+
"""Test a single question and return result."""
|
| 56 |
+
task_id = test_case["task_id"]
|
| 57 |
+
question = test_case["question"]
|
| 58 |
+
expected = test_case.get("expected", "N/A")
|
| 59 |
+
|
| 60 |
+
print(f"\n{'='*60}")
|
| 61 |
+
print(f"Testing: {test_case['name']}")
|
| 62 |
+
print(f"Task ID: {task_id}")
|
| 63 |
+
print(f"Expected: {expected}")
|
| 64 |
+
print(f"{'='*60}")
|
| 65 |
+
|
| 66 |
+
try:
|
| 67 |
+
answer = agent(question, file_path=None)
|
| 68 |
+
|
| 69 |
+
# Check if answer matches expected
|
| 70 |
+
is_correct = answer.strip().lower() == expected.lower().strip()
|
| 71 |
+
|
| 72 |
+
result = {
|
| 73 |
+
"task_id": task_id,
|
| 74 |
+
"name": test_case["name"],
|
| 75 |
+
"question": question[:100] + "..." if len(question) > 100 else question,
|
| 76 |
+
"expected": expected,
|
| 77 |
+
"answer": answer,
|
| 78 |
+
"correct": is_correct,
|
| 79 |
+
"status": "success",
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
# Determine system error
|
| 83 |
+
if not answer:
|
| 84 |
+
result["system_error"] = "yes"
|
| 85 |
+
elif answer.lower().startswith("error:") or "no evidence collected" in answer.lower():
|
| 86 |
+
result["system_error"] = "yes"
|
| 87 |
+
result["error_log"] = answer
|
| 88 |
+
else:
|
| 89 |
+
result["system_error"] = "no"
|
| 90 |
+
|
| 91 |
+
except Exception as e:
|
| 92 |
+
result = {
|
| 93 |
+
"task_id": task_id,
|
| 94 |
+
"name": test_case["name"],
|
| 95 |
+
"question": question[:100] + "..." if len(question) > 100 else question,
|
| 96 |
+
"expected": expected,
|
| 97 |
+
"answer": f"ERROR: {str(e)}",
|
| 98 |
+
"correct": False,
|
| 99 |
+
"status": "error",
|
| 100 |
+
"system_error": "yes",
|
| 101 |
+
"error_log": str(e),
|
| 102 |
+
}
|
| 103 |
+
|
| 104 |
+
# Print result
|
| 105 |
+
status_icon = "✅" if result["correct"] else "❌" if result["system_error"] == "no" else "⚠️"
|
| 106 |
+
print(f"\n{status_icon} Result: {result['answer'][:100]}")
|
| 107 |
+
if result["system_error"] == "yes":
|
| 108 |
+
print(f" System Error: Yes")
|
| 109 |
+
if result.get("error_log"):
|
| 110 |
+
print(f" Error: {result['error_log'][:100]}")
|
| 111 |
+
|
| 112 |
+
return result
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def main():
|
| 116 |
+
"""Run quick tests on specific questions."""
|
| 117 |
+
print("\n" + "="*60)
|
| 118 |
+
print("GAIA Quick Test - Verify Fixes")
|
| 119 |
+
print("="*60)
|
| 120 |
+
|
| 121 |
+
# Check LLM provider
|
| 122 |
+
llm_provider = os.getenv("LLM_PROVIDER", "gemini")
|
| 123 |
+
print(f"\nLLM Provider: {llm_provider}")
|
| 124 |
+
|
| 125 |
+
# Initialize agent
|
| 126 |
+
print("\nInitializing agent...")
|
| 127 |
+
try:
|
| 128 |
+
agent = GAIAAgent()
|
| 129 |
+
print("✅ Agent initialized")
|
| 130 |
+
except Exception as e:
|
| 131 |
+
print(f"❌ Agent initialization failed: {e}")
|
| 132 |
+
return
|
| 133 |
+
|
| 134 |
+
# Run tests
|
| 135 |
+
results = []
|
| 136 |
+
for test_case in TEST_QUESTIONS:
|
| 137 |
+
result = test_question(agent, test_case)
|
| 138 |
+
results.append(result)
|
| 139 |
+
|
| 140 |
+
# Summary
|
| 141 |
+
print(f"\n{'='*60}")
|
| 142 |
+
print("SUMMARY")
|
| 143 |
+
print(f"{'='*60}")
|
| 144 |
+
|
| 145 |
+
success_count = sum(1 for r in results if r["correct"])
|
| 146 |
+
error_count = sum(1 for r in results if r["system_error"] == "yes")
|
| 147 |
+
ai_fail_count = sum(1 for r in results if r["system_error"] == "no" and not r["correct"])
|
| 148 |
+
|
| 149 |
+
print(f"\nTotal: {len(results)}")
|
| 150 |
+
print(f"✅ Correct: {success_count}")
|
| 151 |
+
print(f"⚠️ System Errors: {error_count}")
|
| 152 |
+
print(f"❌ AI Wrong: {ai_fail_count}")
|
| 153 |
+
|
| 154 |
+
# Detailed results
|
| 155 |
+
print(f"\nDetailed Results:")
|
| 156 |
+
for r in results:
|
| 157 |
+
status = "✅" if r["correct"] else "⚠️" if r["system_error"] == "yes" else "❌"
|
| 158 |
+
print(f" {status} {r['name']}: {r['answer'][:50]}{'...' if len(r['answer']) > 50 else ''}")
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
if __name__ == "__main__":
|
| 162 |
+
main()
|