agentbee

Sleeping

mangubee Claude commited on Jan 13

Commit

b68b317

1 Parent(s): a1d5de5

feat: add transcript caching and upgrade synthesis model

- Add save_transcript_to_cache() to save transcripts for debugging
- Upgrade LLM: Qwen 2.5 → openai/gpt-oss-120b (Scaleway)
- Document HF provider suffix behavior (auto-routing is bad practice)
- Model iteration: Qwen 2.5 → Llama 3.3 → gpt-oss-120b

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (3) hide show

CHANGELOG.md +249 -69
src/agent/llm_client.py +3 -2
src/tools/youtube.py +37 -0

CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,134 @@
 # Session Changelog
 ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
 **Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
@@ -7,13 +136,15 @@
 **Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
 **Modified Files:**
 - **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
 - **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
-- **src/tools/__init__.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
 - **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
 - **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
 **Key Technical Decisions:**
 - **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
 - **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
 - **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
@@ -21,15 +152,11 @@
 - **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
 **Expected Impact:**
 - Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
 - Score: 10% → 40% (2/20 → 4/20 correct)
 - **Target achieved:** Exceeds 30% requirement (6/20)
-**Next Steps:**
-- Test on question #3 (bird species)
-- Run full evaluation
-- If successful, implement Phase 2 (MP3 audio support)
 ---
 ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
@@ -40,43 +167,43 @@
 ### Fixed (Course API Contract - DO NOT CHANGE)
-| Aspect | Value | Cannot Change |
-|--------|-------|----------------|
-| **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
-| **Questions Route** | `GET /questions` | ❌ |
-| **Submit Route** | `POST /submit` | ❌ |
-| **Number of Questions** | **20** (always 20) | ❌ |
-| **Question Source** | GAIA validation set, level 1 | ❌ |
-| **Randomness** | **NO - Fixed set** | ❌ |
-| **Difficulty** | All level 1 (easiest) | ❌ |
-| **Filter Criteria** | By tools/steps complexity | ❌ |
-| **Scoring** | EXACT MATCH | ❌ |
-| **Target Score** | 30% = 6/20 correct | ❌ |
 ### The 20 Questions (ALWAYS the Same)
-| # | Full Task ID | Description | Tools Required |
-|---|--------------|-------------|----------------|
-| 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
-| 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
-| 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
-| 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
-| 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
-| 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
-| 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
-| 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
-| 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
-| 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
-| 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
-| 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
-| 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
-| 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
-| 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
-| 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
-| 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
-| 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
-| 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
-| 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
 **NOT random** - same 20 questions every submission!
@@ -96,12 +223,12 @@ submission_data = {
 ### Our Additions (SAFE to Modify)
-| Feature | Purpose | Required? |
-|---------|---------|-----------|
-| Question Limit | Debug: run first N | ✅ Optional |
-| Target Task IDs | Debug: run specific | ✅ Optional |
-| ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
-| System Error Field | UX: error tracking | ✅ Optional |
 | File Download (HF) | Feature: support files | ✅ Optional |
 ### Key Learnings
@@ -121,23 +248,27 @@ submission_data = {
 **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
 **Process:**
 1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
 2. Removed git-specific files (`.git/` folder, `.gitattributes`)
 3. Copied to project as `_template_original/` (static reference, no git)
 4. Cleaned up temporary clone from Downloads
 **Why Static Reference:**
 - No `.git/` folder → won't interfere with project's git
 - No `.gitattributes` → clean file comparison
 - Pure reference material for diff/comparison
 - Can see exactly what changed from original
 **Template Original Contents:**
 - `app.py` (8777 bytes - original)
 - `README.md` (400 bytes - original)
 - `requirements.txt` (15 bytes - original)
 **Comparison Commands:**
 ```bash
 # Compare file sizes
 ls -lh _template_original/app.py app.py
@@ -150,7 +281,8 @@ wc -l app.py _template_original/app.py
 ```
 **Created Files:**
-- **_template_original/** (NEW) - Static reference to original template (3 files)
 ---
@@ -159,6 +291,7 @@ wc -l app.py _template_original/app.py
 **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
 **Actions Taken:**
 1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
 2. Updated local git remote to point to new URL
 3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
@@ -166,12 +299,14 @@ wc -l app.py _template_original/app.py
 5. Pushed commits to renamed Space: `c86df49..41ac444`
 **Key Learnings:**
 - Local folder name ≠ git repo identity (can rename locally without affecting remote)
 - Git remote URL determines push destination (updated to `agentbee`)
 - HuggingFace Space name is independent of local folder name
 - All work preserved through rename process
 **Current State:**
 - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
 - Remote: `mangubee/agentbee` (renamed on HuggingFace)
 - Sync: ✅ All changes pushed
@@ -187,6 +322,7 @@ wc -l app.py _template_original/app.py
 **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
 **Solution:** Created `docs/gaia_submission_guide.md` documenting:
 - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
 - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
 - API routes, submission formats, scoring differences
@@ -202,9 +338,11 @@ wc -l app.py _template_original/app.py
 | Submission | JSON POST | File upload |
 **Created Files:**
 - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
 **Modified Files:**
 - **README.md** - Added note linking to submission guide
 ---
@@ -216,6 +354,7 @@ wc -l app.py _template_original/app.py
 **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
 **Implementation:**
 - Added `eval_task_ids` textbox in UI (line 763-770)
 - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
 - Filtering logic: Parses comma-separated IDs, filters `questions_data`
@@ -223,11 +362,13 @@ wc -l app.py _template_original/app.py
 - Overrides question_limit when provided
 **Usage:**
 ```
 Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
 ```
 **Modified Files:**
 - **app.py** (~30 lines added)
   - UI: `eval_task_ids` textbox
   - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
@@ -244,6 +385,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
 **Modified Files:**
 - **src/tools/calculator.py** (~15 lines modified)
   - `timeout()` context manager: Try/except for signal.alarm() failure
   - Logs warning when timeout protection disabled
@@ -256,17 +398,20 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
 **Solution:** Changed to boolean `system_error: yes/no` field:
 - `system_error: yes` - Technical/system error from our code (don't submit)
 - `system_error: no` - AI response (submit answer, even if wrong)
 - Added `error_log` field with full error details for system errors
 **Implementation:**
 - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
 - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
 - JSON export: `system_error` field, `error_log` field (when system error)
 - Submission logic: Only submit when `system_error == "no"`
 **Modified Files:**
 - **app.py** (~30 lines modified)
   - `a_determine_status()`: Returns tuple instead of string
   - `process_single_question()`: Uses new format, adds `error_log`
@@ -280,6 +425,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
 **Solution:** Removed all fallback-related UI elements:
 - Removed `enable_fallback_checkbox` from Test Question tab
 - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
 - Removed `enable_fallback` parameter from `test_single_question()` function
@@ -288,6 +434,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
 **Modified Files:**
 - **app.py** (~20 lines removed)
   - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
   - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
@@ -299,24 +446,28 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
 **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
 - 4 providers to test for each feature
 - Complex debugging with multiple code paths
 - Longer, less clear error messages
 - Adding complexity without clear benefit
 **Solution:** Archive fallback mechanism, use single provider only
 - Removed fallback provider loop (Gemini → HF → Groq → Claude)
 - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
 - If provider fails, error is raised immediately
 - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
 **Benefits:**
 - ✅ Reduced code complexity
 - ✅ Faster debugging (one code path)
 - ✅ Clearer error messages
 - ✅ No double work on features
 **Modified Files:**
 - **src/agent/llm_client.py** (~25 lines removed)
   - Simplified `_call_with_fallback()`: Removed fallback logic
 - **dev/dev_260112_02_fallback_archived.md** (NEW)
@@ -330,11 +481,13 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
 **Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
 **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
 ```python
 {"results": [...], "source": "tavily", "query": "...", "count": N}
 ```
 **Solution:** Handle both dict formats in evidence extraction:
 ```python
 if isinstance(result, dict):
     if "answer" in result:
@@ -349,6 +502,7 @@ if isinstance(result, dict):
 ```
 **Modified Files:**
 - **src/agent/graph.py** (~40 lines modified)
   - Updated evidence extraction in primary path
   - Updated evidence extraction in fallback path
@@ -356,6 +510,7 @@ if isinstance(result, dict):
 **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
 **Summary of Fixes (Session 2026-01-12):**
 1. ✅ File download from HF dataset (5/5 files)
 2. ✅ Absolute paths from script location
 3. ✅ Evidence formatting for vision tools (dict → answer)
@@ -370,6 +525,7 @@ if isinstance(result, dict):
 **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
 **Solution:** Extract 'answer' field from dict results before adding to evidence:
 ```python
 # Before
 evidence.append(f"[{tool_name}] {result}")  # Dict → string representation
@@ -382,6 +538,7 @@ elif isinstance(result, str):
 ```
 **Modified Files:**
 - **src/agent/graph.py** (~15 lines modified)
   - Updated `execute_node()`: Extract 'answer' from dict results
   - Fixed both primary and fallback execution paths
@@ -399,11 +556,13 @@ elif isinstance(result, str):
 **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
 **Solution:** Return absolute paths from `download_task_file()`
 - Changed: `target_path = os.path.join(save_dir, file_name)`
 - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
 - Now tools can find files regardless of working directory
 **Modified Files:**
 - **app.py** (~3 lines modified)
   - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
@@ -416,6 +575,7 @@ elif isinstance(result, str):
 **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
 **Investigation:**
 - Checked API spec: Endpoint exists with proper documentation
 - Tested download: HTTP 404 "No file path associated with task_id"
 - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
@@ -424,12 +584,14 @@ elif isinstance(result, str):
 **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
 **Solution:** Switch from evaluation API to GAIA dataset download
 - Use `huggingface_hub.hf_hub_download()` to fetch files
 - Download to `_cache/gaia_files/` (runtime cache)
 - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
 - Added cache checking (reuse downloaded files)
 **Files with attachments (5/20 questions):**
 - `cca530fc`: Chess position image (.png)
 - `99c9cc74`: Pie recipe audio (.mp3)
 - `f918266a`: Python code (.py)
@@ -437,6 +599,7 @@ elif isinstance(result, str):
 - `7bd855d8`: Menu sales Excel (.xlsx)
 **Modified Files:**
 - **app.py** (~70 lines modified)
   - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
   - Changed signature: `download_task_file(task_id, file_name, save_dir)`
@@ -447,30 +610,34 @@ elif isinstance(result, str):
   - Updated `process_single_question()`: Pass `file_name` to download function
 **Known Limitations:**
 - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
 - `.mp3` audio files still unsupported
 - `.py` code execution still unsupported
 **Next Steps:**
 1. Test new download implementation
 2. Expand tool support for .mp3 (audio transcription)
 3. Expand tool support for .py (code execution)
 ---
-## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
 **Problem:** Need to validate HF vision works before complex GAIA evaluation.
 **Solution:** Single smoke test with simple red square image.
 **Result:** ✅ PASSED
 - Model: `google/gemma-3-27b-it:scaleway`
 - Answer: "The image is a solid, uniform field of red color..."
 - Provider routing: Working correctly
 - Settings integration: Fixed
 **Modified Files:**
 - **src/config/settings.py** (~5 lines added)
   - Added `HF_TOKEN` and `HF_VISION_MODEL` config
   - Added `hf_token` and `hf_vision_model` to Settings class
@@ -480,6 +647,7 @@ elif isinstance(result, str):
   - Tests basic image description
 **Bug Fixes:**
 - Removed unsupported `timeout` parameter from `chat_completion()`
 **Next Steps:** Phase 3 - GAIA evaluation with HF vision
@@ -491,11 +659,13 @@ elif isinstance(result, str):
 **Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
 **Solution:**
 - Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
 - Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
 - Each provider fails independently (NO fallback chains during testing)
 **Modified Files:**
 - **src/tools/vision.py** (~120 lines added/modified)
   - Added `analyze_image_hf()` function with retry logic
   - Updated `analyze_image()` routing with provider selection
@@ -505,14 +675,15 @@ elif isinstance(result, str):
 **Validated Models (Phase 0 Extended Testing):**
-| Rank | Model | Provider | Speed | Notes |
-|------|-------|----------|-------|-------|
-| 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
-| 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
-| 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
-| 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
 **Failed Models (not vision-capable):**
 - `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
 - `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
 - `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
@@ -531,17 +702,20 @@ elif isinstance(result, str):
 **Test Results:**
 **Working Models:**
 - `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
 - `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
 - `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
 **Failed Models:**
 - `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
 - `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
 - `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
 - `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
 **Output Files:**
 - `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
 - `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
 - `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
@@ -601,11 +775,11 @@ elif isinstance(result, str):
 **Critical Discovery - Large Image Handling:**
-| Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
-|-------|-------------------|---------------------|----------------|
-| aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
-| Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
-| ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
 **API Behavior:**
@@ -693,28 +867,33 @@ elif isinstance(result, str):
 **Solution - Plan Corrections Applied:**
 1. **Added Phase 0: API Validation (CRITICAL)**
    - Test HF Inference API with vision models BEFORE implementation
    - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
    - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
    - Time saved: Prevents 2-3 hours implementing non-functional code
 2. **Removed Fallback Logic from Testing**
    - Each provider fails independently with clear error message
    - NO fallback chains (HF → Gemini → Claude) during testing
    - Philosophy: Build capability knowledge, don't hide problems
    - Log exact failure reasons for debugging
 3. **Added Smoke Tests (Phase 2)**
    - 4 tests before GAIA: description, OCR, counting, single GAIA question
    - Decision gate: ≥3/4 must pass before full evaluation
    - Prevents debugging chess positions when basic integration broken
 4. **Added Decision Gates**
    - Gate 1 (Phase 0): API validation → GO/NO-GO
    - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
    - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
 5. **Added Backup Strategy Documentation**
    - Option C: HF Spaces deployment (custom endpoint)
    - Option D: Local transformers library (no API)
    - Option E: Hybrid (HF text + Gemini/Claude vision)
@@ -741,14 +920,14 @@ elif isinstance(result, str):
 **Key Changes Summary:**
-| Before | After |
-|--------|-------|
-| Jump to implementation | Phase 0: Validate API first |
-| Fallback chains | No fallbacks, fail independently |
-| Large models first (Qwen2-VL) | Small models first (Phi-3.5) |
-| Direct to GAIA | Smoke tests → GAIA |
-| No backup plan | 3 backup options documented |
-| Single success criteria | Per-phase criteria + decision gates |
 **Benefits:**
@@ -859,10 +1038,11 @@ def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
 **Modified Files:**
 - **app.py** (~10 lines modified)
   - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
   - Changed: `exports/` → `_cache/`
-  - Updated docstring: "All environments: Saves to ./_cache/gaia_results_TIMESTAMP.json"
-  - Updated comment: "Save to _cache/ folder (internal runtime storage, not accessible via HF UI)"
 - **.gitignore** (~3 lines added)
   - Added `_cache/` to ignore list

 # Session Changelog
+## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation
+**Discovery:** HuggingFace Provider Suffix Behavior - Auto-Routing is Bad Practice
+**Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
+**Test Result:**
+```python
+# Without provider - WORKS but uses HF default routing
+HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"  # ✅ Works, but...
+# Response: "Test successful."
+# With explicit provider - RECOMMENDED
+HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"  # ✅ Reliable
+```
+**Why Auto-Routing is Bad Practice:**
+| Issue | Impact |
+|-------|--------|
+| **Unpredictable performance** | Provider changes between runs (fast Cerebras → slow Together) |
+| **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
+| **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
+| **Debugging nightmare** | Can't reproduce issues when provider is unknown |
+| **Silent failures** | Provider might be down, HF retries with different one |
+**Best Practice: ALWAYS specify provider**
+```python
+# BAD - Unreliable
+HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
+# GOOD - Explicit, predictable
+HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"
+HF_MODEL = "Qwen/Qwen2.5-72B-Instruct:cerebras"
+HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
+```
+**Available Providers for Text Models:**
+- `:scaleway` - Fast, reliable (recommended for Llama)
+- `:cerebras` - Very fast (recommended for Qwen)
+- `:novita` - Fast, reputable
+- `:together` - Reliable
+- `:sambanova` - Fast but expensive
+**Action Taken:** Updated code to always use explicit `:provider` suffix
+---
+## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
+**Model Changes:**
+1. Qwen 2.5 72B (no provider) → Failed synthesis ("Unable to answer")
+2. Llama 3.3 70B (Scaleway) → Failed synthesis
+3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
+**openai/gpt-oss-120b:**
+- OpenAI's 120B parameter open source model
+- Strong reasoning capability
+- Optimized for function calling and tool use
+---
+## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation (Original)
+**Problem:** Qwen 2.5 72B fails synthesis despite having complete transcript evidence (738 chars).
+**Root Cause Analysis:**
+- Transcript contains all 3 species: "giant petrel", "emperor", "adelie" (Whisper error: "deli")
+- Qwen 2.5 cannot resolve transcription errors ("deli" → "adelie penguin")
+- Qwen 2.5 weak at entity extraction + counting from noisy text
+- Returns "Unable to answer" instead of reasoning through ambiguity
+**Transcript Quality Assessment:**
+- **NOT clear enough for current LLM** - requires:
+  1. Error tolerance ("deli" → "adelie")
+  2. World knowledge (Antarctic bird species)
+  3. Entity extraction from narrative text
+  4. Temporal reasoning ("simultaneously" = same scene)
+**Answer from transcript:** 3 species (giant petrel, emperor penguin, adelie penguin)
+**Solution:** Upgrade to Llama 3.3 70B Instruct (Scaleway provider)
+- Better reasoning and instruction following
+- Stronger entity extraction from noisy context
+- Better at handling transcription ambiguities
+**Modified Files:**
+- **src/agent/llm_client.py** (line 37) - Model: Qwen 2.5 → Llama 3.3 70B
+---
+## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Transcript Caching for Debugging
+**Problem:** Transcription works (738 chars from Whisper) but LLM returns "Unable to answer". Need to inspect raw transcript to debug synthesis failure.
+**Solution:** Added `save_transcript_to_cache()` function to save transcripts to `_cache/{video_id}_transcript.txt` for both API and Whisper paths.
+**Modified Files:**
+- **src/tools/youtube.py** (+30 lines)
+  - Added `save_transcript_to_cache()` function (lines 55-79)
+  - Calls after successful API transcript retrieval (line 164)
+  - Calls after successful Whisper transcription (line 317)
+  - File format includes metadata: video_id, source, length, timestamp
+**File Format:**
+```
+# YouTube Transcript
+# Video ID: L1vXCYZAYYM
+# Source: whisper
+# Length: 738 characters
+# Generated: 2026-01-13T02:27:...
+<transcript text>
+```
+**Next Steps:**
+- Test on question #3 (bird species) - inspect cached transcript
+- Debug LLM synthesis failure if transcript contains correct answer
+---
 ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
 **Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
 **Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
 **Modified Files:**
 - **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
 - **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
+- **src/tools/**init**.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
 - **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
 - **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
 **Key Technical Decisions:**
 - **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
 - **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
 - **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
 - **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
 **Expected Impact:**
 - Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
 - Score: 10% → 40% (2/20 → 4/20 correct)
 - **Target achieved:** Exceeds 30% requirement (6/20)
 ---
 ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
 ### Fixed (Course API Contract - DO NOT CHANGE)
+| Aspect                  | Value                                  | Cannot Change |
+| ----------------------- | -------------------------------------- | ------------- |
+| **API Endpoint**        | `agents-course-unit4-scoring.hf.space` | ❌            |
+| **Questions Route**     | `GET /questions`                       | ❌            |
+| **Submit Route**        | `POST /submit`                         | ❌            |
+| **Number of Questions** | **20** (always 20)                     | ❌            |
+| **Question Source**     | GAIA validation set, level 1           | ❌            |
+| **Randomness**          | **NO - Fixed set**                     | ❌            |
+| **Difficulty**          | All level 1 (easiest)                  | ❌            |
+| **Filter Criteria**     | By tools/steps complexity              | ❌            |
+| **Scoring**             | EXACT MATCH                            | ❌            |
+| **Target Score**        | 30% = 6/20 correct                     | ❌            |
 ### The 20 Questions (ALWAYS the Same)
+| #   | Full Task ID                           | Description                    | Tools Required   |
+| --- | -------------------------------------- | ------------------------------ | ---------------- |
+| 1   | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator)  | Calculator       |
+| 2   | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination  | Web search       |
+| 3   | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species   | Video processing |
+| 4   | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count     | Web search       |
+| 5   | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote   | Video processing |
+| 6   | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity  | CSV file         |
+| 7   | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move  | Image analysis   |
+| 8   | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge        |
+| 9   | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character     | Web search       |
+| 10  | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe          | MP3 audio        |
+| 11  | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname    | Web search       |
+| 12  | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output             | Python execution |
+| 13  | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers  | MP3 audio        |
+| 14  | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number              | PDF processing   |
+| 15  | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city      | Web search       |
+| 16  | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count           | Web search       |
+| 17  | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search       |
+| 18  | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes        | Web search       |
+| 19  | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient    | Web search       |
+| 20  | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation   | Excel file       |
 **NOT random** - same 20 questions every submission!
 ### Our Additions (SAFE to Modify)
+| Feature            | Purpose                | Required?   |
+| ------------------ | ---------------------- | ----------- |
+| Question Limit     | Debug: run first N     | ✅ Optional |
+| Target Task IDs    | Debug: run specific    | ✅ Optional |
+| ThreadPoolExecutor | Speed: concurrent      | ✅ Optional |
+| System Error Field | UX: error tracking     | ✅ Optional |
 | File Download (HF) | Feature: support files | ✅ Optional |
 ### Key Learnings
 **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
 **Process:**
 1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
 2. Removed git-specific files (`.git/` folder, `.gitattributes`)
 3. Copied to project as `_template_original/` (static reference, no git)
 4. Cleaned up temporary clone from Downloads
 **Why Static Reference:**
 - No `.git/` folder → won't interfere with project's git
 - No `.gitattributes` → clean file comparison
 - Pure reference material for diff/comparison
 - Can see exactly what changed from original
 **Template Original Contents:**
 - `app.py` (8777 bytes - original)
 - `README.md` (400 bytes - original)
 - `requirements.txt` (15 bytes - original)
 **Comparison Commands:**
 ```bash
 # Compare file sizes
 ls -lh _template_original/app.py app.py
 ```
 **Created Files:**
+- **\_template_original/** (NEW) - Static reference to original template (3 files)
 ---
 **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
 **Actions Taken:**
 1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
 2. Updated local git remote to point to new URL
 3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
 5. Pushed commits to renamed Space: `c86df49..41ac444`
 **Key Learnings:**
 - Local folder name ≠ git repo identity (can rename locally without affecting remote)
 - Git remote URL determines push destination (updated to `agentbee`)
 - HuggingFace Space name is independent of local folder name
 - All work preserved through rename process
 **Current State:**
 - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
 - Remote: `mangubee/agentbee` (renamed on HuggingFace)
 - Sync: ✅ All changes pushed
 **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
 **Solution:** Created `docs/gaia_submission_guide.md` documenting:
 - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
 - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
 - API routes, submission formats, scoring differences
 | Submission | JSON POST | File upload |
 **Created Files:**
 - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
 **Modified Files:**
 - **README.md** - Added note linking to submission guide
 ---
 **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
 **Implementation:**
 - Added `eval_task_ids` textbox in UI (line 763-770)
 - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
 - Filtering logic: Parses comma-separated IDs, filters `questions_data`
 - Overrides question_limit when provided
 **Usage:**
 ```
 Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
 ```
 **Modified Files:**
 - **app.py** (~30 lines added)
   - UI: `eval_task_ids` textbox
   - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
 **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
 **Modified Files:**
 - **src/tools/calculator.py** (~15 lines modified)
   - `timeout()` context manager: Try/except for signal.alarm() failure
   - Logs warning when timeout protection disabled
 **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
 **Solution:** Changed to boolean `system_error: yes/no` field:
 - `system_error: yes` - Technical/system error from our code (don't submit)
 - `system_error: no` - AI response (submit answer, even if wrong)
 - Added `error_log` field with full error details for system errors
 **Implementation:**
 - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
 - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
 - JSON export: `system_error` field, `error_log` field (when system error)
 - Submission logic: Only submit when `system_error == "no"`
 **Modified Files:**
 - **app.py** (~30 lines modified)
   - `a_determine_status()`: Returns tuple instead of string
   - `process_single_question()`: Uses new format, adds `error_log`
 **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
 **Solution:** Removed all fallback-related UI elements:
 - Removed `enable_fallback_checkbox` from Test Question tab
 - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
 - Removed `enable_fallback` parameter from `test_single_question()` function
 - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
 **Modified Files:**
 - **app.py** (~20 lines removed)
   - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
   - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
 ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
 **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
 - 4 providers to test for each feature
 - Complex debugging with multiple code paths
 - Longer, less clear error messages
 - Adding complexity without clear benefit
 **Solution:** Archive fallback mechanism, use single provider only
 - Removed fallback provider loop (Gemini → HF → Groq → Claude)
 - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
 - If provider fails, error is raised immediately
 - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
 **Benefits:**
 - ✅ Reduced code complexity
 - ✅ Faster debugging (one code path)
 - ✅ Clearer error messages
 - ✅ No double work on features
 **Modified Files:**
 - **src/agent/llm_client.py** (~25 lines removed)
   - Simplified `_call_with_fallback()`: Removed fallback logic
 - **dev/dev_260112_02_fallback_archived.md** (NEW)
 **Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
 **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
 ```python
 {"results": [...], "source": "tavily", "query": "...", "count": N}
 ```
 **Solution:** Handle both dict formats in evidence extraction:
 ```python
 if isinstance(result, dict):
     if "answer" in result:
 ```
 **Modified Files:**
 - **src/agent/graph.py** (~40 lines modified)
   - Updated evidence extraction in primary path
   - Updated evidence extraction in fallback path
 **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
 **Summary of Fixes (Session 2026-01-12):**
 1. ✅ File download from HF dataset (5/5 files)
 2. ✅ Absolute paths from script location
 3. ✅ Evidence formatting for vision tools (dict → answer)
 **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
 **Solution:** Extract 'answer' field from dict results before adding to evidence:
 ```python
 # Before
 evidence.append(f"[{tool_name}] {result}")  # Dict → string representation
 ```
 **Modified Files:**
 - **src/agent/graph.py** (~15 lines modified)
   - Updated `execute_node()`: Extract 'answer' from dict results
   - Fixed both primary and fallback execution paths
 **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
 **Solution:** Return absolute paths from `download_task_file()`
 - Changed: `target_path = os.path.join(save_dir, file_name)`
 - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
 - Now tools can find files regardless of working directory
 **Modified Files:**
 - **app.py** (~3 lines modified)
   - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
 **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
 **Investigation:**
 - Checked API spec: Endpoint exists with proper documentation
 - Tested download: HTTP 404 "No file path associated with task_id"
 - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
 **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
 **Solution:** Switch from evaluation API to GAIA dataset download
 - Use `huggingface_hub.hf_hub_download()` to fetch files
 - Download to `_cache/gaia_files/` (runtime cache)
 - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
 - Added cache checking (reuse downloaded files)
 **Files with attachments (5/20 questions):**
 - `cca530fc`: Chess position image (.png)
 - `99c9cc74`: Pie recipe audio (.mp3)
 - `f918266a`: Python code (.py)
 - `7bd855d8`: Menu sales Excel (.xlsx)
 **Modified Files:**
 - **app.py** (~70 lines modified)
   - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
   - Changed signature: `download_task_file(task_id, file_name, save_dir)`
   - Updated `process_single_question()`: Pass `file_name` to download function
 **Known Limitations:**
 - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
 - `.mp3` audio files still unsupported
 - `.py` code execution still unsupported
 **Next Steps:**
 1. Test new download implementation
 2. Expand tool support for .mp3 (audio transcription)
 3. Expand tool support for .py (code execution)
 ---
+## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision LLM Validated - Ready for GAIA
 **Problem:** Need to validate HF vision works before complex GAIA evaluation.
 **Solution:** Single smoke test with simple red square image.
 **Result:** ✅ PASSED
 - Model: `google/gemma-3-27b-it:scaleway`
 - Answer: "The image is a solid, uniform field of red color..."
 - Provider routing: Working correctly
 - Settings integration: Fixed
 **Modified Files:**
 - **src/config/settings.py** (~5 lines added)
   - Added `HF_TOKEN` and `HF_VISION_MODEL` config
   - Added `hf_token` and `hf_vision_model` to Settings class
   - Tests basic image description
 **Bug Fixes:**
 - Removed unsupported `timeout` parameter from `chat_completion()`
 **Next Steps:** Phase 3 - GAIA evaluation with HF vision
 **Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
 **Solution:**
 - Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
 - Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
 - Each provider fails independently (NO fallback chains during testing)
 **Modified Files:**
 - **src/tools/vision.py** (~120 lines added/modified)
   - Added `analyze_image_hf()` function with retry logic
   - Updated `analyze_image()` routing with provider selection
 **Validated Models (Phase 0 Extended Testing):**
+| Rank | Model                            | Provider | Speed | Notes                          |
+| ---- | -------------------------------- | -------- | ----- | ------------------------------ |
+| 1    | `google/gemma-3-27b-it`          | Scaleway | ~6s   | **RECOMMENDED** - Google brand |
+| 2    | `CohereLabs/aya-vision-32b`      | Cohere   | ~7s   | Fast, less known brand         |
+| 3    | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita   | ~14s  | Qwen brand, reputable          |
+| 4    | `zai-org/GLM-4.6V-Flash`         | zai-org  | ~16s  | Zhipu AI brand                 |
 **Failed Models (not vision-capable):**
 - `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
 - `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
 - `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
 **Test Results:**
 **Working Models:**
 - `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
 - `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
 - `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
 **Failed Models:**
 - `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
 - `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
 - `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
 - `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
 **Output Files:**
 - `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
 - `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
 - `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
 **Critical Discovery - Large Image Handling:**
+| Model                | Small Image (1KB) | Large Image (2.8MB) | Recommendation               |
+| -------------------- | ----------------- | ------------------- | ---------------------------- |
+| aya-vision-32b       | ✅ 1-3s           | ✅ ~10s             | **Use for production**       |
+| Qwen3-VL-8B-Instruct | ✅ 1-3s           | ❌ >120s timeout    | Use with image preprocessing |
+| ERNIE-4.5-VL-424B    | ✅ 1-3s           | ❌ >120s timeout    | Use with image preprocessing |
 **API Behavior:**
 **Solution - Plan Corrections Applied:**
 1. **Added Phase 0: API Validation (CRITICAL)**
    - Test HF Inference API with vision models BEFORE implementation
    - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
    - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
    - Time saved: Prevents 2-3 hours implementing non-functional code
 2. **Removed Fallback Logic from Testing**
    - Each provider fails independently with clear error message
    - NO fallback chains (HF → Gemini → Claude) during testing
    - Philosophy: Build capability knowledge, don't hide problems
    - Log exact failure reasons for debugging
 3. **Added Smoke Tests (Phase 2)**
    - 4 tests before GAIA: description, OCR, counting, single GAIA question
    - Decision gate: ≥3/4 must pass before full evaluation
    - Prevents debugging chess positions when basic integration broken
 4. **Added Decision Gates**
    - Gate 1 (Phase 0): API validation → GO/NO-GO
    - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
    - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
 5. **Added Backup Strategy Documentation**
    - Option C: HF Spaces deployment (custom endpoint)
    - Option D: Local transformers library (no API)
    - Option E: Hybrid (HF text + Gemini/Claude vision)
 **Key Changes Summary:**
+| Before                        | After                               |
+| ----------------------------- | ----------------------------------- |
+| Jump to implementation        | Phase 0: Validate API first         |
+| Fallback chains               | No fallbacks, fail independently    |
+| Large models first (Qwen2-VL) | Small models first (Phi-3.5)        |
+| Direct to GAIA                | Smoke tests → GAIA                  |
+| No backup plan                | 3 backup options documented         |
+| Single success criteria       | Per-phase criteria + decision gates |
 **Benefits:**
 **Modified Files:**
 - **app.py** (~10 lines modified)
   - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
   - Changed: `exports/` → `_cache/`
+  - Updated docstring: "All environments: Saves to ./\_cache/gaia_results_TIMESTAMP.json"
+  - Updated comment: "Save to \_cache/ folder (internal runtime storage, not accessible via HF UI)"
 - **.gitignore** (~3 lines added)
   - Added `_cache/` to ignore list

src/agent/llm_client.py CHANGED Viewed

@@ -34,8 +34,9 @@ CLAUDE_MODEL = "claude-sonnet-4-5-20250929"
 GEMINI_MODEL = "gemini-2.0-flash-exp"
 # HuggingFace Configuration
-HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"  # Excellent for function calling and reasoning
-# Alternatives: "meta-llama/Llama-3.1-70B-Instruct", "NousResearch/Hermes-3-Llama-3.1-70B"
 # Groq Configuration
 GROQ_MODEL = "openai/gpt-oss-120b"

 GEMINI_MODEL = "gemini-2.0-flash-exp"
 # HuggingFace Configuration
+HF_MODEL = "openai/gpt-oss-120b:scaleway"  # OpenAI's 120B open source model, strong reasoning
+# Previous: "meta-llama/Llama-3.3-70B-Instruct:scaleway" (failed synthesis)
+# Previous: "Qwen/Qwen2.5-72B-Instruct" (weaker at handling transcription errors)
 # Groq Configuration
 GROQ_MODEL = "openai/gpt-oss-120b"

src/tools/youtube.py CHANGED Viewed

@@ -48,6 +48,37 @@ CLEANUP_TEMP_FILES = True
 logger = logging.getLogger(__name__)
 # ============================================================================
 # YouTube URL Parser
 # =============================================================================
@@ -129,6 +160,9 @@ def get_youtube_transcript(video_id: str) -> Dict[str, Any]:
         logger.info(f"Transcript fetched: {len(text)} characters")
         return {
             "text": text,
             "video_id": video_id,
@@ -279,6 +313,9 @@ def transcribe_from_audio(video_url: str) -> Dict[str, Any]:
                 logger.warning(f"Failed to cleanup temp file: {e}")
         if result["success"]:
             return {
                 "text": result["text"],
                 "video_id": video_id,

 logger = logging.getLogger(__name__)
+# ============================================================================
+# Transcript Cache
+# ============================================================================
+def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
+    """
+    Save transcript to _cache folder for debugging.
+    Args:
+        video_id: YouTube video ID
+        text: Transcript text
+        source: "api" or "whisper"
+    """
+    try:
+        cache_dir = Path("_cache")
+        cache_dir.mkdir(exist_ok=True)
+        cache_file = cache_dir / f"{video_id}_transcript.txt"
+        with open(cache_file, "w", encoding="utf-8") as f:
+            f.write(f"# YouTube Transcript\n")
+            f.write(f"# Video ID: {video_id}\n")
+            f.write(f"# Source: {source}\n")
+            f.write(f"# Length: {len(text)} characters\n")
+            f.write(f"# Generated: {__import__('datetime').datetime.now().isoformat()}\n")
+            f.write(f"\n{text}\n")
+        logger.info(f"Transcript saved to cache: {cache_file}")
+    except Exception as e:
+        logger.warning(f"Failed to save transcript to cache: {e}")
 # ============================================================================
 # YouTube URL Parser
 # =============================================================================
         logger.info(f"Transcript fetched: {len(text)} characters")
+        # Save to cache for debugging
+        save_transcript_to_cache(video_id, text, "api")
         return {
             "text": text,
             "video_id": video_id,
                 logger.warning(f"Failed to cleanup temp file: {e}")
         if result["success"]:
+            # Save to cache for debugging
+            save_transcript_to_cache(video_id, result["text"], "whisper")
             return {
                 "text": result["text"],
                 "video_id": video_id,