mangubee Claude commited on
Commit
b68b317
·
1 Parent(s): a1d5de5

feat: add transcript caching and upgrade synthesis model

Browse files

- Add save_transcript_to_cache() to save transcripts for debugging
- Upgrade LLM: Qwen 2.5 → openai/gpt-oss-120b (Scaleway)
- Document HF provider suffix behavior (auto-routing is bad practice)
- Model iteration: Qwen 2.5 → Llama 3.3 → gpt-oss-120b

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (3) hide show
  1. CHANGELOG.md +249 -69
  2. src/agent/llm_client.py +3 -2
  3. src/tools/youtube.py +37 -0
CHANGELOG.md CHANGED
@@ -1,5 +1,134 @@
1
  # Session Changelog
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
4
 
5
  **Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
@@ -7,13 +136,15 @@
7
  **Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
8
 
9
  **Modified Files:**
 
10
  - **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
11
  - **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
12
- - **src/tools/__init__.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
13
  - **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
14
  - **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
15
 
16
  **Key Technical Decisions:**
 
17
  - **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
18
  - **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
19
  - **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
@@ -21,15 +152,11 @@
21
  - **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
22
 
23
  **Expected Impact:**
 
24
  - Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
25
  - Score: 10% → 40% (2/20 → 4/20 correct)
26
  - **Target achieved:** Exceeds 30% requirement (6/20)
27
 
28
- **Next Steps:**
29
- - Test on question #3 (bird species)
30
- - Run full evaluation
31
- - If successful, implement Phase 2 (MP3 audio support)
32
-
33
  ---
34
 
35
  ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
@@ -40,43 +167,43 @@
40
 
41
  ### Fixed (Course API Contract - DO NOT CHANGE)
42
 
43
- | Aspect | Value | Cannot Change |
44
- |--------|-------|----------------|
45
- | **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
46
- | **Questions Route** | `GET /questions` | ❌ |
47
- | **Submit Route** | `POST /submit` | ❌ |
48
- | **Number of Questions** | **20** (always 20) | ❌ |
49
- | **Question Source** | GAIA validation set, level 1 | ❌ |
50
- | **Randomness** | **NO - Fixed set** | ❌ |
51
- | **Difficulty** | All level 1 (easiest) | ❌ |
52
- | **Filter Criteria** | By tools/steps complexity | ❌ |
53
- | **Scoring** | EXACT MATCH | ❌ |
54
- | **Target Score** | 30% = 6/20 correct | ❌ |
55
 
56
  ### The 20 Questions (ALWAYS the Same)
57
 
58
- | # | Full Task ID | Description | Tools Required |
59
- |---|--------------|-------------|----------------|
60
- | 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
61
- | 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
62
- | 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
63
- | 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
64
- | 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
65
- | 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
66
- | 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
67
- | 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
68
- | 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
69
- | 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
70
- | 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
71
- | 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
72
- | 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
73
- | 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
74
- | 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
75
- | 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
76
- | 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
77
- | 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
78
- | 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
79
- | 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
80
 
81
  **NOT random** - same 20 questions every submission!
82
 
@@ -96,12 +223,12 @@ submission_data = {
96
 
97
  ### Our Additions (SAFE to Modify)
98
 
99
- | Feature | Purpose | Required? |
100
- |---------|---------|-----------|
101
- | Question Limit | Debug: run first N | ✅ Optional |
102
- | Target Task IDs | Debug: run specific | ✅ Optional |
103
- | ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
104
- | System Error Field | UX: error tracking | ✅ Optional |
105
  | File Download (HF) | Feature: support files | ✅ Optional |
106
 
107
  ### Key Learnings
@@ -121,23 +248,27 @@ submission_data = {
121
  **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
122
 
123
  **Process:**
 
124
  1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
125
  2. Removed git-specific files (`.git/` folder, `.gitattributes`)
126
  3. Copied to project as `_template_original/` (static reference, no git)
127
  4. Cleaned up temporary clone from Downloads
128
 
129
  **Why Static Reference:**
 
130
  - No `.git/` folder → won't interfere with project's git
131
  - No `.gitattributes` → clean file comparison
132
  - Pure reference material for diff/comparison
133
  - Can see exactly what changed from original
134
 
135
  **Template Original Contents:**
 
136
  - `app.py` (8777 bytes - original)
137
  - `README.md` (400 bytes - original)
138
  - `requirements.txt` (15 bytes - original)
139
 
140
  **Comparison Commands:**
 
141
  ```bash
142
  # Compare file sizes
143
  ls -lh _template_original/app.py app.py
@@ -150,7 +281,8 @@ wc -l app.py _template_original/app.py
150
  ```
151
 
152
  **Created Files:**
153
- - **_template_original/** (NEW) - Static reference to original template (3 files)
 
154
 
155
  ---
156
 
@@ -159,6 +291,7 @@ wc -l app.py _template_original/app.py
159
  **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
160
 
161
  **Actions Taken:**
 
162
  1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
163
  2. Updated local git remote to point to new URL
164
  3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
@@ -166,12 +299,14 @@ wc -l app.py _template_original/app.py
166
  5. Pushed commits to renamed Space: `c86df49..41ac444`
167
 
168
  **Key Learnings:**
 
169
  - Local folder name ≠ git repo identity (can rename locally without affecting remote)
170
  - Git remote URL determines push destination (updated to `agentbee`)
171
  - HuggingFace Space name is independent of local folder name
172
  - All work preserved through rename process
173
 
174
  **Current State:**
 
175
  - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
176
  - Remote: `mangubee/agentbee` (renamed on HuggingFace)
177
  - Sync: ✅ All changes pushed
@@ -187,6 +322,7 @@ wc -l app.py _template_original/app.py
187
  **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
188
 
189
  **Solution:** Created `docs/gaia_submission_guide.md` documenting:
 
190
  - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
191
  - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
192
  - API routes, submission formats, scoring differences
@@ -202,9 +338,11 @@ wc -l app.py _template_original/app.py
202
  | Submission | JSON POST | File upload |
203
 
204
  **Created Files:**
 
205
  - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
206
 
207
  **Modified Files:**
 
208
  - **README.md** - Added note linking to submission guide
209
 
210
  ---
@@ -216,6 +354,7 @@ wc -l app.py _template_original/app.py
216
  **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
217
 
218
  **Implementation:**
 
219
  - Added `eval_task_ids` textbox in UI (line 763-770)
220
  - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
221
  - Filtering logic: Parses comma-separated IDs, filters `questions_data`
@@ -223,11 +362,13 @@ wc -l app.py _template_original/app.py
223
  - Overrides question_limit when provided
224
 
225
  **Usage:**
 
226
  ```
227
  Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
228
  ```
229
 
230
  **Modified Files:**
 
231
  - **app.py** (~30 lines added)
232
  - UI: `eval_task_ids` textbox
233
  - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
@@ -244,6 +385,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
244
  **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
245
 
246
  **Modified Files:**
 
247
  - **src/tools/calculator.py** (~15 lines modified)
248
  - `timeout()` context manager: Try/except for signal.alarm() failure
249
  - Logs warning when timeout protection disabled
@@ -256,17 +398,20 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
256
  **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
257
 
258
  **Solution:** Changed to boolean `system_error: yes/no` field:
 
259
  - `system_error: yes` - Technical/system error from our code (don't submit)
260
  - `system_error: no` - AI response (submit answer, even if wrong)
261
  - Added `error_log` field with full error details for system errors
262
 
263
  **Implementation:**
 
264
  - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
265
  - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
266
  - JSON export: `system_error` field, `error_log` field (when system error)
267
  - Submission logic: Only submit when `system_error == "no"`
268
 
269
  **Modified Files:**
 
270
  - **app.py** (~30 lines modified)
271
  - `a_determine_status()`: Returns tuple instead of string
272
  - `process_single_question()`: Uses new format, adds `error_log`
@@ -280,6 +425,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
280
  **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
281
 
282
  **Solution:** Removed all fallback-related UI elements:
 
283
  - Removed `enable_fallback_checkbox` from Test Question tab
284
  - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
285
  - Removed `enable_fallback` parameter from `test_single_question()` function
@@ -288,6 +434,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
288
  - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
289
 
290
  **Modified Files:**
 
291
  - **app.py** (~20 lines removed)
292
  - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
293
  - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
@@ -299,24 +446,28 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
299
  ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
300
 
301
  **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
 
302
  - 4 providers to test for each feature
303
  - Complex debugging with multiple code paths
304
  - Longer, less clear error messages
305
  - Adding complexity without clear benefit
306
 
307
  **Solution:** Archive fallback mechanism, use single provider only
 
308
  - Removed fallback provider loop (Gemini → HF → Groq → Claude)
309
  - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
310
  - If provider fails, error is raised immediately
311
  - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
312
 
313
  **Benefits:**
 
314
  - ✅ Reduced code complexity
315
  - ✅ Faster debugging (one code path)
316
  - ✅ Clearer error messages
317
  - ✅ No double work on features
318
 
319
  **Modified Files:**
 
320
  - **src/agent/llm_client.py** (~25 lines removed)
321
  - Simplified `_call_with_fallback()`: Removed fallback logic
322
  - **dev/dev_260112_02_fallback_archived.md** (NEW)
@@ -330,11 +481,13 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
330
  **Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
331
 
332
  **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
 
333
  ```python
334
  {"results": [...], "source": "tavily", "query": "...", "count": N}
335
  ```
336
 
337
  **Solution:** Handle both dict formats in evidence extraction:
 
338
  ```python
339
  if isinstance(result, dict):
340
  if "answer" in result:
@@ -349,6 +502,7 @@ if isinstance(result, dict):
349
  ```
350
 
351
  **Modified Files:**
 
352
  - **src/agent/graph.py** (~40 lines modified)
353
  - Updated evidence extraction in primary path
354
  - Updated evidence extraction in fallback path
@@ -356,6 +510,7 @@ if isinstance(result, dict):
356
  **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
357
 
358
  **Summary of Fixes (Session 2026-01-12):**
 
359
  1. ✅ File download from HF dataset (5/5 files)
360
  2. ✅ Absolute paths from script location
361
  3. ✅ Evidence formatting for vision tools (dict → answer)
@@ -370,6 +525,7 @@ if isinstance(result, dict):
370
  **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
371
 
372
  **Solution:** Extract 'answer' field from dict results before adding to evidence:
 
373
  ```python
374
  # Before
375
  evidence.append(f"[{tool_name}] {result}") # Dict → string representation
@@ -382,6 +538,7 @@ elif isinstance(result, str):
382
  ```
383
 
384
  **Modified Files:**
 
385
  - **src/agent/graph.py** (~15 lines modified)
386
  - Updated `execute_node()`: Extract 'answer' from dict results
387
  - Fixed both primary and fallback execution paths
@@ -399,11 +556,13 @@ elif isinstance(result, str):
399
  **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
400
 
401
  **Solution:** Return absolute paths from `download_task_file()`
 
402
  - Changed: `target_path = os.path.join(save_dir, file_name)`
403
  - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
404
  - Now tools can find files regardless of working directory
405
 
406
  **Modified Files:**
 
407
  - **app.py** (~3 lines modified)
408
  - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
409
 
@@ -416,6 +575,7 @@ elif isinstance(result, str):
416
  **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
417
 
418
  **Investigation:**
 
419
  - Checked API spec: Endpoint exists with proper documentation
420
  - Tested download: HTTP 404 "No file path associated with task_id"
421
  - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
@@ -424,12 +584,14 @@ elif isinstance(result, str):
424
  **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
425
 
426
  **Solution:** Switch from evaluation API to GAIA dataset download
 
427
  - Use `huggingface_hub.hf_hub_download()` to fetch files
428
  - Download to `_cache/gaia_files/` (runtime cache)
429
  - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
430
  - Added cache checking (reuse downloaded files)
431
 
432
  **Files with attachments (5/20 questions):**
 
433
  - `cca530fc`: Chess position image (.png)
434
  - `99c9cc74`: Pie recipe audio (.mp3)
435
  - `f918266a`: Python code (.py)
@@ -437,6 +599,7 @@ elif isinstance(result, str):
437
  - `7bd855d8`: Menu sales Excel (.xlsx)
438
 
439
  **Modified Files:**
 
440
  - **app.py** (~70 lines modified)
441
  - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
442
  - Changed signature: `download_task_file(task_id, file_name, save_dir)`
@@ -447,30 +610,34 @@ elif isinstance(result, str):
447
  - Updated `process_single_question()`: Pass `file_name` to download function
448
 
449
  **Known Limitations:**
 
450
  - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
451
  - `.mp3` audio files still unsupported
452
  - `.py` code execution still unsupported
453
 
454
  **Next Steps:**
 
455
  1. Test new download implementation
456
  2. Expand tool support for .mp3 (audio transcription)
457
  3. Expand tool support for .py (code execution)
458
 
459
  ---
460
 
461
- ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
462
 
463
  **Problem:** Need to validate HF vision works before complex GAIA evaluation.
464
 
465
  **Solution:** Single smoke test with simple red square image.
466
 
467
  **Result:** ✅ PASSED
 
468
  - Model: `google/gemma-3-27b-it:scaleway`
469
  - Answer: "The image is a solid, uniform field of red color..."
470
  - Provider routing: Working correctly
471
  - Settings integration: Fixed
472
 
473
  **Modified Files:**
 
474
  - **src/config/settings.py** (~5 lines added)
475
  - Added `HF_TOKEN` and `HF_VISION_MODEL` config
476
  - Added `hf_token` and `hf_vision_model` to Settings class
@@ -480,6 +647,7 @@ elif isinstance(result, str):
480
  - Tests basic image description
481
 
482
  **Bug Fixes:**
 
483
  - Removed unsupported `timeout` parameter from `chat_completion()`
484
 
485
  **Next Steps:** Phase 3 - GAIA evaluation with HF vision
@@ -491,11 +659,13 @@ elif isinstance(result, str):
491
  **Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
492
 
493
  **Solution:**
 
494
  - Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
495
  - Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
496
  - Each provider fails independently (NO fallback chains during testing)
497
 
498
  **Modified Files:**
 
499
  - **src/tools/vision.py** (~120 lines added/modified)
500
  - Added `analyze_image_hf()` function with retry logic
501
  - Updated `analyze_image()` routing with provider selection
@@ -505,14 +675,15 @@ elif isinstance(result, str):
505
 
506
  **Validated Models (Phase 0 Extended Testing):**
507
 
508
- | Rank | Model | Provider | Speed | Notes |
509
- |------|-------|----------|-------|-------|
510
- | 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
511
- | 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
512
- | 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
513
- | 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
514
 
515
  **Failed Models (not vision-capable):**
 
516
  - `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
517
  - `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
518
  - `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
@@ -531,17 +702,20 @@ elif isinstance(result, str):
531
  **Test Results:**
532
 
533
  **Working Models:**
 
534
  - `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
535
  - `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
536
  - `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
537
 
538
  **Failed Models:**
 
539
  - `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
540
  - `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
541
  - `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
542
  - `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
543
 
544
  **Output Files:**
 
545
  - `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
546
  - `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
547
  - `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
@@ -601,11 +775,11 @@ elif isinstance(result, str):
601
 
602
  **Critical Discovery - Large Image Handling:**
603
 
604
- | Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
605
- |-------|-------------------|---------------------|----------------|
606
- | aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
607
- | Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
608
- | ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
609
 
610
  **API Behavior:**
611
 
@@ -693,28 +867,33 @@ elif isinstance(result, str):
693
  **Solution - Plan Corrections Applied:**
694
 
695
  1. **Added Phase 0: API Validation (CRITICAL)**
 
696
  - Test HF Inference API with vision models BEFORE implementation
697
  - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
698
  - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
699
  - Time saved: Prevents 2-3 hours implementing non-functional code
700
 
701
  2. **Removed Fallback Logic from Testing**
 
702
  - Each provider fails independently with clear error message
703
  - NO fallback chains (HF → Gemini → Claude) during testing
704
  - Philosophy: Build capability knowledge, don't hide problems
705
  - Log exact failure reasons for debugging
706
 
707
  3. **Added Smoke Tests (Phase 2)**
 
708
  - 4 tests before GAIA: description, OCR, counting, single GAIA question
709
  - Decision gate: ≥3/4 must pass before full evaluation
710
  - Prevents debugging chess positions when basic integration broken
711
 
712
  4. **Added Decision Gates**
 
713
  - Gate 1 (Phase 0): API validation → GO/NO-GO
714
  - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
715
  - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
716
 
717
  5. **Added Backup Strategy Documentation**
 
718
  - Option C: HF Spaces deployment (custom endpoint)
719
  - Option D: Local transformers library (no API)
720
  - Option E: Hybrid (HF text + Gemini/Claude vision)
@@ -741,14 +920,14 @@ elif isinstance(result, str):
741
 
742
  **Key Changes Summary:**
743
 
744
- | Before | After |
745
- |--------|-------|
746
- | Jump to implementation | Phase 0: Validate API first |
747
- | Fallback chains | No fallbacks, fail independently |
748
- | Large models first (Qwen2-VL) | Small models first (Phi-3.5) |
749
- | Direct to GAIA | Smoke tests → GAIA |
750
- | No backup plan | 3 backup options documented |
751
- | Single success criteria | Per-phase criteria + decision gates |
752
 
753
  **Benefits:**
754
 
@@ -859,10 +1038,11 @@ def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
859
  **Modified Files:**
860
 
861
  - **app.py** (~10 lines modified)
 
862
  - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
863
  - Changed: `exports/` → `_cache/`
864
- - Updated docstring: "All environments: Saves to ./_cache/gaia_results_TIMESTAMP.json"
865
- - Updated comment: "Save to _cache/ folder (internal runtime storage, not accessible via HF UI)"
866
 
867
  - **.gitignore** (~3 lines added)
868
  - Added `_cache/` to ignore list
 
1
  # Session Changelog
2
 
3
+ ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation
4
+
5
+ **Discovery:** HuggingFace Provider Suffix Behavior - Auto-Routing is Bad Practice
6
+
7
+ **Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
8
+
9
+ **Test Result:**
10
+ ```python
11
+ # Without provider - WORKS but uses HF default routing
12
+ HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # ✅ Works, but...
13
+ # Response: "Test successful."
14
+
15
+ # With explicit provider - RECOMMENDED
16
+ HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway" # ✅ Reliable
17
+ ```
18
+
19
+ **Why Auto-Routing is Bad Practice:**
20
+
21
+ | Issue | Impact |
22
+ |-------|--------|
23
+ | **Unpredictable performance** | Provider changes between runs (fast Cerebras → slow Together) |
24
+ | **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
25
+ | **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
26
+ | **Debugging nightmare** | Can't reproduce issues when provider is unknown |
27
+ | **Silent failures** | Provider might be down, HF retries with different one |
28
+
29
+ **Best Practice: ALWAYS specify provider**
30
+
31
+ ```python
32
+ # BAD - Unreliable
33
+ HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
34
+
35
+ # GOOD - Explicit, predictable
36
+ HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"
37
+ HF_MODEL = "Qwen/Qwen2.5-72B-Instruct:cerebras"
38
+ HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
39
+ ```
40
+
41
+ **Available Providers for Text Models:**
42
+ - `:scaleway` - Fast, reliable (recommended for Llama)
43
+ - `:cerebras` - Very fast (recommended for Qwen)
44
+ - `:novita` - Fast, reputable
45
+ - `:together` - Reliable
46
+ - `:sambanova` - Fast but expensive
47
+
48
+ **Action Taken:** Updated code to always use explicit `:provider` suffix
49
+
50
+ ---
51
+
52
+ ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
53
+
54
+ **Model Changes:**
55
+ 1. Qwen 2.5 72B (no provider) → Failed synthesis ("Unable to answer")
56
+ 2. Llama 3.3 70B (Scaleway) → Failed synthesis
57
+ 3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
58
+
59
+ **openai/gpt-oss-120b:**
60
+ - OpenAI's 120B parameter open source model
61
+ - Strong reasoning capability
62
+ - Optimized for function calling and tool use
63
+
64
+ ---
65
+
66
+ ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation (Original)
67
+
68
+ **Problem:** Qwen 2.5 72B fails synthesis despite having complete transcript evidence (738 chars).
69
+
70
+ **Root Cause Analysis:**
71
+
72
+ - Transcript contains all 3 species: "giant petrel", "emperor", "adelie" (Whisper error: "deli")
73
+ - Qwen 2.5 cannot resolve transcription errors ("deli" → "adelie penguin")
74
+ - Qwen 2.5 weak at entity extraction + counting from noisy text
75
+ - Returns "Unable to answer" instead of reasoning through ambiguity
76
+
77
+ **Transcript Quality Assessment:**
78
+
79
+ - **NOT clear enough for current LLM** - requires:
80
+ 1. Error tolerance ("deli" → "adelie")
81
+ 2. World knowledge (Antarctic bird species)
82
+ 3. Entity extraction from narrative text
83
+ 4. Temporal reasoning ("simultaneously" = same scene)
84
+
85
+ **Answer from transcript:** 3 species (giant petrel, emperor penguin, adelie penguin)
86
+
87
+ **Solution:** Upgrade to Llama 3.3 70B Instruct (Scaleway provider)
88
+
89
+ - Better reasoning and instruction following
90
+ - Stronger entity extraction from noisy context
91
+ - Better at handling transcription ambiguities
92
+
93
+ **Modified Files:**
94
+
95
+ - **src/agent/llm_client.py** (line 37) - Model: Qwen 2.5 → Llama 3.3 70B
96
+
97
+ ---
98
+
99
+ ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Transcript Caching for Debugging
100
+
101
+ **Problem:** Transcription works (738 chars from Whisper) but LLM returns "Unable to answer". Need to inspect raw transcript to debug synthesis failure.
102
+
103
+ **Solution:** Added `save_transcript_to_cache()` function to save transcripts to `_cache/{video_id}_transcript.txt` for both API and Whisper paths.
104
+
105
+ **Modified Files:**
106
+
107
+ - **src/tools/youtube.py** (+30 lines)
108
+ - Added `save_transcript_to_cache()` function (lines 55-79)
109
+ - Calls after successful API transcript retrieval (line 164)
110
+ - Calls after successful Whisper transcription (line 317)
111
+ - File format includes metadata: video_id, source, length, timestamp
112
+
113
+ **File Format:**
114
+
115
+ ```
116
+ # YouTube Transcript
117
+ # Video ID: L1vXCYZAYYM
118
+ # Source: whisper
119
+ # Length: 738 characters
120
+ # Generated: 2026-01-13T02:27:...
121
+
122
+ <transcript text>
123
+ ```
124
+
125
+ **Next Steps:**
126
+
127
+ - Test on question #3 (bird species) - inspect cached transcript
128
+ - Debug LLM synthesis failure if transcript contains correct answer
129
+
130
+ ---
131
+
132
  ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
133
 
134
  **Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
 
136
  **Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
137
 
138
  **Modified Files:**
139
+
140
  - **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
141
  - **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
142
+ - **src/tools/**init**.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
143
  - **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
144
  - **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
145
 
146
  **Key Technical Decisions:**
147
+
148
  - **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
149
  - **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
150
  - **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
 
152
  - **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
153
 
154
  **Expected Impact:**
155
+
156
  - Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
157
  - Score: 10% → 40% (2/20 → 4/20 correct)
158
  - **Target achieved:** Exceeds 30% requirement (6/20)
159
 
 
 
 
 
 
160
  ---
161
 
162
  ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
 
167
 
168
  ### Fixed (Course API Contract - DO NOT CHANGE)
169
 
170
+ | Aspect | Value | Cannot Change |
171
+ | ----------------------- | -------------------------------------- | ------------- |
172
+ | **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
173
+ | **Questions Route** | `GET /questions` | ❌ |
174
+ | **Submit Route** | `POST /submit` | ❌ |
175
+ | **Number of Questions** | **20** (always 20) | ❌ |
176
+ | **Question Source** | GAIA validation set, level 1 | ❌ |
177
+ | **Randomness** | **NO - Fixed set** | ❌ |
178
+ | **Difficulty** | All level 1 (easiest) | ❌ |
179
+ | **Filter Criteria** | By tools/steps complexity | ❌ |
180
+ | **Scoring** | EXACT MATCH | ❌ |
181
+ | **Target Score** | 30% = 6/20 correct | ❌ |
182
 
183
  ### The 20 Questions (ALWAYS the Same)
184
 
185
+ | # | Full Task ID | Description | Tools Required |
186
+ | --- | -------------------------------------- | ------------------------------ | ---------------- |
187
+ | 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
188
+ | 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
189
+ | 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
190
+ | 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
191
+ | 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
192
+ | 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
193
+ | 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
194
+ | 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
195
+ | 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
196
+ | 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
197
+ | 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
198
+ | 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
199
+ | 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
200
+ | 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
201
+ | 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
202
+ | 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
203
+ | 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
204
+ | 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
205
+ | 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
206
+ | 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
207
 
208
  **NOT random** - same 20 questions every submission!
209
 
 
223
 
224
  ### Our Additions (SAFE to Modify)
225
 
226
+ | Feature | Purpose | Required? |
227
+ | ------------------ | ---------------------- | ----------- |
228
+ | Question Limit | Debug: run first N | ✅ Optional |
229
+ | Target Task IDs | Debug: run specific | ✅ Optional |
230
+ | ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
231
+ | System Error Field | UX: error tracking | ✅ Optional |
232
  | File Download (HF) | Feature: support files | ✅ Optional |
233
 
234
  ### Key Learnings
 
248
  **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
249
 
250
  **Process:**
251
+
252
  1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
253
  2. Removed git-specific files (`.git/` folder, `.gitattributes`)
254
  3. Copied to project as `_template_original/` (static reference, no git)
255
  4. Cleaned up temporary clone from Downloads
256
 
257
  **Why Static Reference:**
258
+
259
  - No `.git/` folder → won't interfere with project's git
260
  - No `.gitattributes` → clean file comparison
261
  - Pure reference material for diff/comparison
262
  - Can see exactly what changed from original
263
 
264
  **Template Original Contents:**
265
+
266
  - `app.py` (8777 bytes - original)
267
  - `README.md` (400 bytes - original)
268
  - `requirements.txt` (15 bytes - original)
269
 
270
  **Comparison Commands:**
271
+
272
  ```bash
273
  # Compare file sizes
274
  ls -lh _template_original/app.py app.py
 
281
  ```
282
 
283
  **Created Files:**
284
+
285
+ - **\_template_original/** (NEW) - Static reference to original template (3 files)
286
 
287
  ---
288
 
 
291
  **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
292
 
293
  **Actions Taken:**
294
+
295
  1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
296
  2. Updated local git remote to point to new URL
297
  3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
 
299
  5. Pushed commits to renamed Space: `c86df49..41ac444`
300
 
301
  **Key Learnings:**
302
+
303
  - Local folder name ≠ git repo identity (can rename locally without affecting remote)
304
  - Git remote URL determines push destination (updated to `agentbee`)
305
  - HuggingFace Space name is independent of local folder name
306
  - All work preserved through rename process
307
 
308
  **Current State:**
309
+
310
  - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
311
  - Remote: `mangubee/agentbee` (renamed on HuggingFace)
312
  - Sync: ✅ All changes pushed
 
322
  **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
323
 
324
  **Solution:** Created `docs/gaia_submission_guide.md` documenting:
325
+
326
  - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
327
  - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
328
  - API routes, submission formats, scoring differences
 
338
  | Submission | JSON POST | File upload |
339
 
340
  **Created Files:**
341
+
342
  - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
343
 
344
  **Modified Files:**
345
+
346
  - **README.md** - Added note linking to submission guide
347
 
348
  ---
 
354
  **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
355
 
356
  **Implementation:**
357
+
358
  - Added `eval_task_ids` textbox in UI (line 763-770)
359
  - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
360
  - Filtering logic: Parses comma-separated IDs, filters `questions_data`
 
362
  - Overrides question_limit when provided
363
 
364
  **Usage:**
365
+
366
  ```
367
  Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
368
  ```
369
 
370
  **Modified Files:**
371
+
372
  - **app.py** (~30 lines added)
373
  - UI: `eval_task_ids` textbox
374
  - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
 
385
  **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
386
 
387
  **Modified Files:**
388
+
389
  - **src/tools/calculator.py** (~15 lines modified)
390
  - `timeout()` context manager: Try/except for signal.alarm() failure
391
  - Logs warning when timeout protection disabled
 
398
  **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
399
 
400
  **Solution:** Changed to boolean `system_error: yes/no` field:
401
+
402
  - `system_error: yes` - Technical/system error from our code (don't submit)
403
  - `system_error: no` - AI response (submit answer, even if wrong)
404
  - Added `error_log` field with full error details for system errors
405
 
406
  **Implementation:**
407
+
408
  - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
409
  - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
410
  - JSON export: `system_error` field, `error_log` field (when system error)
411
  - Submission logic: Only submit when `system_error == "no"`
412
 
413
  **Modified Files:**
414
+
415
  - **app.py** (~30 lines modified)
416
  - `a_determine_status()`: Returns tuple instead of string
417
  - `process_single_question()`: Uses new format, adds `error_log`
 
425
  **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
426
 
427
  **Solution:** Removed all fallback-related UI elements:
428
+
429
  - Removed `enable_fallback_checkbox` from Test Question tab
430
  - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
431
  - Removed `enable_fallback` parameter from `test_single_question()` function
 
434
  - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
435
 
436
  **Modified Files:**
437
+
438
  - **app.py** (~20 lines removed)
439
  - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
440
  - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
 
446
  ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
447
 
448
  **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
449
+
450
  - 4 providers to test for each feature
451
  - Complex debugging with multiple code paths
452
  - Longer, less clear error messages
453
  - Adding complexity without clear benefit
454
 
455
  **Solution:** Archive fallback mechanism, use single provider only
456
+
457
  - Removed fallback provider loop (Gemini → HF → Groq → Claude)
458
  - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
459
  - If provider fails, error is raised immediately
460
  - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
461
 
462
  **Benefits:**
463
+
464
  - ✅ Reduced code complexity
465
  - ✅ Faster debugging (one code path)
466
  - ✅ Clearer error messages
467
  - ✅ No double work on features
468
 
469
  **Modified Files:**
470
+
471
  - **src/agent/llm_client.py** (~25 lines removed)
472
  - Simplified `_call_with_fallback()`: Removed fallback logic
473
  - **dev/dev_260112_02_fallback_archived.md** (NEW)
 
481
  **Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
482
 
483
  **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
484
+
485
  ```python
486
  {"results": [...], "source": "tavily", "query": "...", "count": N}
487
  ```
488
 
489
  **Solution:** Handle both dict formats in evidence extraction:
490
+
491
  ```python
492
  if isinstance(result, dict):
493
  if "answer" in result:
 
502
  ```
503
 
504
  **Modified Files:**
505
+
506
  - **src/agent/graph.py** (~40 lines modified)
507
  - Updated evidence extraction in primary path
508
  - Updated evidence extraction in fallback path
 
510
  **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
511
 
512
  **Summary of Fixes (Session 2026-01-12):**
513
+
514
  1. ✅ File download from HF dataset (5/5 files)
515
  2. ✅ Absolute paths from script location
516
  3. ✅ Evidence formatting for vision tools (dict → answer)
 
525
  **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
526
 
527
  **Solution:** Extract 'answer' field from dict results before adding to evidence:
528
+
529
  ```python
530
  # Before
531
  evidence.append(f"[{tool_name}] {result}") # Dict → string representation
 
538
  ```
539
 
540
  **Modified Files:**
541
+
542
  - **src/agent/graph.py** (~15 lines modified)
543
  - Updated `execute_node()`: Extract 'answer' from dict results
544
  - Fixed both primary and fallback execution paths
 
556
  **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
557
 
558
  **Solution:** Return absolute paths from `download_task_file()`
559
+
560
  - Changed: `target_path = os.path.join(save_dir, file_name)`
561
  - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
562
  - Now tools can find files regardless of working directory
563
 
564
  **Modified Files:**
565
+
566
  - **app.py** (~3 lines modified)
567
  - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
568
 
 
575
  **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
576
 
577
  **Investigation:**
578
+
579
  - Checked API spec: Endpoint exists with proper documentation
580
  - Tested download: HTTP 404 "No file path associated with task_id"
581
  - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
 
584
  **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
585
 
586
  **Solution:** Switch from evaluation API to GAIA dataset download
587
+
588
  - Use `huggingface_hub.hf_hub_download()` to fetch files
589
  - Download to `_cache/gaia_files/` (runtime cache)
590
  - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
591
  - Added cache checking (reuse downloaded files)
592
 
593
  **Files with attachments (5/20 questions):**
594
+
595
  - `cca530fc`: Chess position image (.png)
596
  - `99c9cc74`: Pie recipe audio (.mp3)
597
  - `f918266a`: Python code (.py)
 
599
  - `7bd855d8`: Menu sales Excel (.xlsx)
600
 
601
  **Modified Files:**
602
+
603
  - **app.py** (~70 lines modified)
604
  - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
605
  - Changed signature: `download_task_file(task_id, file_name, save_dir)`
 
610
  - Updated `process_single_question()`: Pass `file_name` to download function
611
 
612
  **Known Limitations:**
613
+
614
  - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
615
  - `.mp3` audio files still unsupported
616
  - `.py` code execution still unsupported
617
 
618
  **Next Steps:**
619
+
620
  1. Test new download implementation
621
  2. Expand tool support for .mp3 (audio transcription)
622
  3. Expand tool support for .py (code execution)
623
 
624
  ---
625
 
626
+ ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision LLM Validated - Ready for GAIA
627
 
628
  **Problem:** Need to validate HF vision works before complex GAIA evaluation.
629
 
630
  **Solution:** Single smoke test with simple red square image.
631
 
632
  **Result:** ✅ PASSED
633
+
634
  - Model: `google/gemma-3-27b-it:scaleway`
635
  - Answer: "The image is a solid, uniform field of red color..."
636
  - Provider routing: Working correctly
637
  - Settings integration: Fixed
638
 
639
  **Modified Files:**
640
+
641
  - **src/config/settings.py** (~5 lines added)
642
  - Added `HF_TOKEN` and `HF_VISION_MODEL` config
643
  - Added `hf_token` and `hf_vision_model` to Settings class
 
647
  - Tests basic image description
648
 
649
  **Bug Fixes:**
650
+
651
  - Removed unsupported `timeout` parameter from `chat_completion()`
652
 
653
  **Next Steps:** Phase 3 - GAIA evaluation with HF vision
 
659
  **Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
660
 
661
  **Solution:**
662
+
663
  - Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
664
  - Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
665
  - Each provider fails independently (NO fallback chains during testing)
666
 
667
  **Modified Files:**
668
+
669
  - **src/tools/vision.py** (~120 lines added/modified)
670
  - Added `analyze_image_hf()` function with retry logic
671
  - Updated `analyze_image()` routing with provider selection
 
675
 
676
  **Validated Models (Phase 0 Extended Testing):**
677
 
678
+ | Rank | Model | Provider | Speed | Notes |
679
+ | ---- | -------------------------------- | -------- | ----- | ------------------------------ |
680
+ | 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
681
+ | 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
682
+ | 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
683
+ | 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
684
 
685
  **Failed Models (not vision-capable):**
686
+
687
  - `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
688
  - `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
689
  - `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
 
702
  **Test Results:**
703
 
704
  **Working Models:**
705
+
706
  - `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
707
  - `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
708
  - `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
709
 
710
  **Failed Models:**
711
+
712
  - `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
713
  - `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
714
  - `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
715
  - `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
716
 
717
  **Output Files:**
718
+
719
  - `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
720
  - `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
721
  - `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
 
775
 
776
  **Critical Discovery - Large Image Handling:**
777
 
778
+ | Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
779
+ | -------------------- | ----------------- | ------------------- | ---------------------------- |
780
+ | aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
781
+ | Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
782
+ | ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
783
 
784
  **API Behavior:**
785
 
 
867
  **Solution - Plan Corrections Applied:**
868
 
869
  1. **Added Phase 0: API Validation (CRITICAL)**
870
+
871
  - Test HF Inference API with vision models BEFORE implementation
872
  - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
873
  - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
874
  - Time saved: Prevents 2-3 hours implementing non-functional code
875
 
876
  2. **Removed Fallback Logic from Testing**
877
+
878
  - Each provider fails independently with clear error message
879
  - NO fallback chains (HF → Gemini → Claude) during testing
880
  - Philosophy: Build capability knowledge, don't hide problems
881
  - Log exact failure reasons for debugging
882
 
883
  3. **Added Smoke Tests (Phase 2)**
884
+
885
  - 4 tests before GAIA: description, OCR, counting, single GAIA question
886
  - Decision gate: ≥3/4 must pass before full evaluation
887
  - Prevents debugging chess positions when basic integration broken
888
 
889
  4. **Added Decision Gates**
890
+
891
  - Gate 1 (Phase 0): API validation → GO/NO-GO
892
  - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
893
  - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
894
 
895
  5. **Added Backup Strategy Documentation**
896
+
897
  - Option C: HF Spaces deployment (custom endpoint)
898
  - Option D: Local transformers library (no API)
899
  - Option E: Hybrid (HF text + Gemini/Claude vision)
 
920
 
921
  **Key Changes Summary:**
922
 
923
+ | Before | After |
924
+ | ----------------------------- | ----------------------------------- |
925
+ | Jump to implementation | Phase 0: Validate API first |
926
+ | Fallback chains | No fallbacks, fail independently |
927
+ | Large models first (Qwen2-VL) | Small models first (Phi-3.5) |
928
+ | Direct to GAIA | Smoke tests → GAIA |
929
+ | No backup plan | 3 backup options documented |
930
+ | Single success criteria | Per-phase criteria + decision gates |
931
 
932
  **Benefits:**
933
 
 
1038
  **Modified Files:**
1039
 
1040
  - **app.py** (~10 lines modified)
1041
+
1042
  - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
1043
  - Changed: `exports/` → `_cache/`
1044
+ - Updated docstring: "All environments: Saves to ./\_cache/gaia_results_TIMESTAMP.json"
1045
+ - Updated comment: "Save to \_cache/ folder (internal runtime storage, not accessible via HF UI)"
1046
 
1047
  - **.gitignore** (~3 lines added)
1048
  - Added `_cache/` to ignore list
src/agent/llm_client.py CHANGED
@@ -34,8 +34,9 @@ CLAUDE_MODEL = "claude-sonnet-4-5-20250929"
34
  GEMINI_MODEL = "gemini-2.0-flash-exp"
35
 
36
  # HuggingFace Configuration
37
- HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # Excellent for function calling and reasoning
38
- # Alternatives: "meta-llama/Llama-3.1-70B-Instruct", "NousResearch/Hermes-3-Llama-3.1-70B"
 
39
 
40
  # Groq Configuration
41
  GROQ_MODEL = "openai/gpt-oss-120b"
 
34
  GEMINI_MODEL = "gemini-2.0-flash-exp"
35
 
36
  # HuggingFace Configuration
37
+ HF_MODEL = "openai/gpt-oss-120b:scaleway" # OpenAI's 120B open source model, strong reasoning
38
+ # Previous: "meta-llama/Llama-3.3-70B-Instruct:scaleway" (failed synthesis)
39
+ # Previous: "Qwen/Qwen2.5-72B-Instruct" (weaker at handling transcription errors)
40
 
41
  # Groq Configuration
42
  GROQ_MODEL = "openai/gpt-oss-120b"
src/tools/youtube.py CHANGED
@@ -48,6 +48,37 @@ CLEANUP_TEMP_FILES = True
48
  logger = logging.getLogger(__name__)
49
 
50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  # ============================================================================
52
  # YouTube URL Parser
53
  # =============================================================================
@@ -129,6 +160,9 @@ def get_youtube_transcript(video_id: str) -> Dict[str, Any]:
129
 
130
  logger.info(f"Transcript fetched: {len(text)} characters")
131
 
 
 
 
132
  return {
133
  "text": text,
134
  "video_id": video_id,
@@ -279,6 +313,9 @@ def transcribe_from_audio(video_url: str) -> Dict[str, Any]:
279
  logger.warning(f"Failed to cleanup temp file: {e}")
280
 
281
  if result["success"]:
 
 
 
282
  return {
283
  "text": result["text"],
284
  "video_id": video_id,
 
48
  logger = logging.getLogger(__name__)
49
 
50
 
51
+ # ============================================================================
52
+ # Transcript Cache
53
+ # ============================================================================
54
+
55
+ def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
56
+ """
57
+ Save transcript to _cache folder for debugging.
58
+
59
+ Args:
60
+ video_id: YouTube video ID
61
+ text: Transcript text
62
+ source: "api" or "whisper"
63
+ """
64
+ try:
65
+ cache_dir = Path("_cache")
66
+ cache_dir.mkdir(exist_ok=True)
67
+
68
+ cache_file = cache_dir / f"{video_id}_transcript.txt"
69
+ with open(cache_file, "w", encoding="utf-8") as f:
70
+ f.write(f"# YouTube Transcript\n")
71
+ f.write(f"# Video ID: {video_id}\n")
72
+ f.write(f"# Source: {source}\n")
73
+ f.write(f"# Length: {len(text)} characters\n")
74
+ f.write(f"# Generated: {__import__('datetime').datetime.now().isoformat()}\n")
75
+ f.write(f"\n{text}\n")
76
+
77
+ logger.info(f"Transcript saved to cache: {cache_file}")
78
+ except Exception as e:
79
+ logger.warning(f"Failed to save transcript to cache: {e}")
80
+
81
+
82
  # ============================================================================
83
  # YouTube URL Parser
84
  # =============================================================================
 
160
 
161
  logger.info(f"Transcript fetched: {len(text)} characters")
162
 
163
+ # Save to cache for debugging
164
+ save_transcript_to_cache(video_id, text, "api")
165
+
166
  return {
167
  "text": text,
168
  "video_id": video_id,
 
313
  logger.warning(f"Failed to cleanup temp file: {e}")
314
 
315
  if result["success"]:
316
+ # Save to cache for debugging
317
+ save_transcript_to_cache(video_id, result["text"], "whisper")
318
+
319
  return {
320
  "text": result["text"],
321
  "video_id": video_id,