mangubee Claude commited on
Commit
0d77f39
·
1 Parent(s): 2577d6f

feat: phase1 planning and video processing research

Browse files

- Rewrite PLAN.md: focus on system error fixes (current 10% → 30% target)
- Add brainstorming_phase1_youtube.md: YouTube transcript approach research
- Add _template_original/: static reference for comparison
- Update CHANGELOG.md: course test setup analysis, 20 fixed questions documented
- Research findings:
* Transcript-first approach: 1-3s vs frame extraction 1-10min total
* Frame extraction is fast (5-20s), vision processing is slow
* Whisper open-source on ZeroGPU: 5-10x speedup, free
* Unified Phase 1+2 architecture: single transcribe_audio() function

Co-Authored-By: Claude <noreply@anthropic.com>

CHANGELOG.md CHANGED
@@ -1,5 +1,153 @@
1
  # Session Changelog
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
4
 
5
  **Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
 
1
  # Session Changelog
2
 
3
+ ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
4
+
5
+ **Purpose:** Understand which parts of template are FIXED (course API contract) vs CAN MODIFY (our improvements).
6
+
7
+ **Critical Finding:** Course API has a FIXED test setup - questions are NOT random.
8
+
9
+ ### Fixed (Course API Contract - DO NOT CHANGE)
10
+
11
+ | Aspect | Value | Cannot Change |
12
+ |--------|-------|----------------|
13
+ | **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
14
+ | **Questions Route** | `GET /questions` | ❌ |
15
+ | **Submit Route** | `POST /submit` | ❌ |
16
+ | **Number of Questions** | **20** (always 20) | ❌ |
17
+ | **Question Source** | GAIA validation set, level 1 | ❌ |
18
+ | **Randomness** | **NO - Fixed set** | ❌ |
19
+ | **Difficulty** | All level 1 (easiest) | ❌ |
20
+ | **Filter Criteria** | By tools/steps complexity | ❌ |
21
+ | **Scoring** | EXACT MATCH | ❌ |
22
+ | **Target Score** | 30% = 6/20 correct | ❌ |
23
+
24
+ ### The 20 Questions (ALWAYS the Same)
25
+
26
+ | # | Full Task ID | Description | Tools Required |
27
+ |---|--------------|-------------|----------------|
28
+ | 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
29
+ | 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
30
+ | 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
31
+ | 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
32
+ | 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
33
+ | 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
34
+ | 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
35
+ | 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
36
+ | 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
37
+ | 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
38
+ | 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
39
+ | 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
40
+ | 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
41
+ | 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
42
+ | 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
43
+ | 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
44
+ | 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
45
+ | 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
46
+ | 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
47
+ | 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
48
+
49
+ **NOT random** - same 20 questions every submission!
50
+
51
+ ### Template Contract (MUST Preserve)
52
+
53
+ ```python
54
+ # REQUIRED - Do NOT change
55
+ questions_url = f"{api_url}/questions" # Fixed route
56
+ submit_url = f"{api_url}/submit" # Fixed route
57
+
58
+ submission_data = {
59
+ "username": username,
60
+ "agent_code": agent_code,
61
+ "answers": answers_payload # Fixed format
62
+ }
63
+ ```
64
+
65
+ ### Our Additions (SAFE to Modify)
66
+
67
+ | Feature | Purpose | Required? |
68
+ |---------|---------|-----------|
69
+ | Question Limit | Debug: run first N | ✅ Optional |
70
+ | Target Task IDs | Debug: run specific | ✅ Optional |
71
+ | ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
72
+ | System Error Field | UX: error tracking | ✅ Optional |
73
+ | File Download (HF) | Feature: support files | ✅ Optional |
74
+
75
+ ### Key Learnings
76
+
77
+ 1. **Question set is FIXED** - not random, always same 20
78
+ 2. **API routes are FIXED** - cannot change endpoints
79
+ 3. **Submission format is FIXED** - must match exactly
80
+ 4. **Our additions are OPTIONAL** - debug/features we added
81
+ 5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
82
+
83
+ **Reference:** `_template_original/app.py` for original structure
84
+
85
+ ---
86
+
87
+ ## [2026-01-12] [Infrastructure] [COMPLETED] Original Template Reference Added
88
+
89
+ **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
90
+
91
+ **Process:**
92
+ 1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
93
+ 2. Removed git-specific files (`.git/` folder, `.gitattributes`)
94
+ 3. Copied to project as `_template_original/` (static reference, no git)
95
+ 4. Cleaned up temporary clone from Downloads
96
+
97
+ **Why Static Reference:**
98
+ - No `.git/` folder → won't interfere with project's git
99
+ - No `.gitattributes` → clean file comparison
100
+ - Pure reference material for diff/comparison
101
+ - Can see exactly what changed from original
102
+
103
+ **Template Original Contents:**
104
+ - `app.py` (8777 bytes - original)
105
+ - `README.md` (400 bytes - original)
106
+ - `requirements.txt` (15 bytes - original)
107
+
108
+ **Comparison Commands:**
109
+ ```bash
110
+ # Compare file sizes
111
+ ls -lh _template_original/app.py app.py
112
+
113
+ # See differences
114
+ diff _template_original/app.py app.py
115
+
116
+ # Count lines added
117
+ wc -l app.py _template_original/app.py
118
+ ```
119
+
120
+ **Created Files:**
121
+ - **_template_original/** (NEW) - Static reference to original template (3 files)
122
+
123
+ ---
124
+
125
+ ## [2026-01-12] [Infrastructure] [COMPLETED] HuggingFace Space Renamed
126
+
127
+ **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
128
+
129
+ **Actions Taken:**
130
+ 1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
131
+ 2. Updated local git remote to point to new URL
132
+ 3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
133
+ 4. Pulled from remote (sync after rename - already up to date)
134
+ 5. Pushed commits to renamed Space: `c86df49..41ac444`
135
+
136
+ **Key Learnings:**
137
+ - Local folder name ≠ git repo identity (can rename locally without affecting remote)
138
+ - Git remote URL determines push destination (updated to `agentbee`)
139
+ - HuggingFace Space name is independent of local folder name
140
+ - All work preserved through rename process
141
+
142
+ **Current State:**
143
+ - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
144
+ - Remote: `mangubee/agentbee` (renamed on HuggingFace)
145
+ - Sync: ✅ All changes pushed
146
+ - Git: All commits synced
147
+ - Template: `_template_original/` added for comparison
148
+
149
+ ---
150
+
151
  ## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
152
 
153
  **Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
PLAN.md CHANGED
@@ -1,698 +1,338 @@
1
- # Implementation Plan - LLM Selection Routing & HuggingFace Vision Support
2
 
3
- **Date:** 2026-01-06
4
- **Status:** Planning
 
 
5
 
6
  ## Objective
7
 
8
- Fix LLM selection routing so UI provider selection propagates to ALL tools (planning, tool selection, synthesis, AND vision). Enable vision capability using HuggingFace multimodal models.
9
 
10
- ## Current Problems
11
 
12
- 1. **Vision tool ignores UI selection** - Hardcoded Gemini → Claude fallback
13
- 2. **No HuggingFace vision support** - HF Inference API integration missing multimodal capability
14
- 3. **Inconsistent routing** - Planning/tool selection respect UI, vision doesn't
15
 
16
- ## Solution Architecture
 
 
 
17
 
18
- ### Part 1: Fix LLM Selection Routing
19
 
20
- **Goal:** When user selects "HuggingFace" in UI, ALL agent components use HuggingFace LLM
 
 
 
 
 
 
 
21
 
22
- **Changes needed:**
23
 
24
- 1. **Vision tool (src/tools/vision.py):**
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
- - Add `analyze_image_hf()` function for HuggingFace multimodal models
27
- - Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
28
- - Route to correct provider: `gemini`, `huggingface`, `groq`, `claude`
29
- - Respect `ENABLE_LLM_FALLBACK` setting
30
 
31
- 2. **Ensure consistency:**
32
- - Planning: Already respects `LLM_PROVIDER`
33
- - Tool selection: Already respects `LLM_PROVIDER`
34
- - Synthesis: Already respects `LLM_PROVIDER`
35
- - Vision: **NEEDS FIX** - Add routing logic
36
 
37
- ### Part 2: HuggingFace Vision Capability
 
 
 
38
 
39
- **Two approaches identified:**
40
 
41
- #### Option A: Direct Multimodal LLM (Preferred)
42
 
43
- **Approach:** Use HuggingFace multimodal models that support vision + text
44
 
45
- **Candidate models:**
 
 
 
46
 
47
- 1. **Qwen/Qwen2-VL-72B-Instruct** (Recommended)
48
 
49
- - 72B parameters, vision-language model
50
- - Supports: images, video, text
51
- - API: HuggingFace Inference API (paid tier)
52
- - Format: Base64 image + text prompt
53
-
54
- 2. **meta-llama/Llama-3.2-90B-Vision-Instruct**
55
-
56
- - 90B parameters, multimodal
57
- - Supports: images + text
58
- - API: HuggingFace Inference API
59
-
60
- 3. **microsoft/Phi-3.5-vision-instruct**
61
- - Smaller model (3.8B), efficient
62
- - Supports: images + text
63
- - Good for testing/debugging
64
 
65
  **Implementation:**
66
-
67
- - Use `InferenceClient.chat_completion()` with image content
68
- - Send base64-encoded images in messages array
69
- - Similar to Claude vision integration pattern
 
 
 
 
 
 
 
 
 
70
 
71
  **Pros:**
72
-
73
- - ✅ Native vision understanding
74
- - ✅ Single API call (no preprocessing)
75
- - ✅ Better accuracy for visual reasoning
76
- - ✅ Consistent with current architecture
77
 
78
  **Cons:**
 
 
79
 
80
- - Requires HuggingFace paid tier (but user confirmed they have this)
81
- - ❌ Need to verify which models work with Inference API
82
-
83
- #### Option B: Image-to-Text Preprocessing
84
-
85
- **Approach:** Convert images to text descriptions using separate tool, then feed to text-only LLM
86
-
87
- **Tools available:**
88
 
89
- 1. **BLIP-2** (Salesforce/blip2-opt-2.7b)
90
-
91
- - Image captioning model
92
- - Converts image → text description
93
-
94
- 2. **LLaVA** (llava-hf/llava-1.5-7b-hf)
95
-
96
- - Vision-language assistant
97
- - Image → detailed text
98
-
99
- 3. **OpenCV + OCR** (pytesseract)
100
- - Extract text from images
101
- - Good for documents/screenshots
102
 
103
  **Implementation:**
 
 
 
104
 
105
- - Load image → Run BLIP-2/LLaVA → Get text description
106
- - Pass text description to HuggingFace text-only LLM
107
- - Two-step process: vision → text → reasoning
108
-
109
- **Pros:**
110
-
111
- - ✅ Works with any text-only LLM
112
- - ✅ Cheaper (can use smaller vision models)
113
- - ✅ Fallback option if multimodal API unavailable
114
-
115
- **Cons:**
116
-
117
- - ❌ Two API calls (slower)
118
- - ❌ Information loss in image → text conversion
119
- - ❌ Poor for complex visual reasoning (chess positions, video analysis)
120
- - ❌ Extra dependency management
121
-
122
- ## Recommended Approach
123
-
124
- **Use Option A: Direct Multimodal LLM (Qwen2-VL-72B-Instruct)**
125
-
126
- **Reasoning:**
127
-
128
- 1. User has HuggingFace paid tier access (confirmed)
129
- 2. GAIA questions require complex visual reasoning (chess positions, video analysis)
130
- 3. Simpler architecture - consistent with existing pattern
131
- 4. Better accuracy for benchmark performance
132
- 5. Focus on HF testing first, Groq later
133
-
134
- **Fallback:** Keep Option B as backup if multimodal API doesn't work
135
-
136
- ## Implementation Steps
137
-
138
- ### Phase 0: API Validation (CRITICAL - DO THIS FIRST)
139
-
140
- **Goal:** Validate HuggingFace Inference API supports vision BEFORE implementation
141
-
142
- **Decision Gate 1:** Only proceed to Phase 1 if at least one model works
143
 
144
- #### Step 0.1: Test HF Inference API with Vision Models
145
 
146
- - [ ] Test **Phi-3.5-vision-instruct** (3.8B) - Smallest, fastest iteration
147
- - [ ] Test **Llama-3.2-11B-Vision-Instruct** - Medium model
148
- - [ ] Test **Qwen2-VL-72B-Instruct** - Largest, only if needed
149
- - [ ] Simple test: Load apple image, ask "What is this?"
150
- - [ ] Verify API accepts vision input (base64, URL, or file path)
151
- - [ ] Document response format and error patterns
152
-
153
- #### Step 0.2: Test Image Format Support
154
-
155
- - [ ] Base64 encoding in messages
156
- - [ ] Direct URL support
157
- - [ ] Local file path support
158
- - [ ] Document which format(s) work
159
-
160
- #### Step 0.3: Document API Behavior
161
 
162
- - [ ] Response structure (JSON schema)
163
- - [ ] Error patterns (quota, rate limit, invalid input)
164
- - [ ] Rate limits and quotas
165
- - [ ] Model selection recommendation
166
 
167
- #### Step 0.4: Decision Gate - GO/NO-GO
 
 
 
 
168
 
169
- - [ ] **GO:** At least 1 model works → Proceed to Phase 1
170
- - [ ] **NO-GO:** 0 models work → Pivot to backup options:
171
- - **Option C:** HF Spaces deployment (custom endpoint)
172
- - **Option D:** Local transformers library (no API)
173
- - **Option E:** Hybrid (HF text + Gemini/Claude vision only)
174
 
175
- **Phase 0 Status:** ✅ COMPLETED - Multiple working models found
 
 
 
 
 
 
 
176
 
177
- **Validated Models (Ranked by Speed):**
178
 
179
- | Rank | Model | Provider | Speed | Notes |
180
- |------|-------|----------|-------|-------|
181
- | 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand, fastest |
182
- | 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
183
- | 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
184
- | 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
185
 
186
- **Format:** Base64 encoding only (file:// URLs don't work)
187
- **Test image:** 2.1MB workspace photo (realistic large image)
188
 
189
  ---
190
 
191
- ### Phase 1: HuggingFace Vision Implementation
192
 
193
- **Goal:** Implement `analyze_image_hf()` using validated API pattern
194
 
195
- **Validated from Phase 0:**
196
 
197
- - Model: `google/gemma-3-27b-it:scaleway` (RECOMMENDED - fastest, Google brand)
198
- - Format: Base64 encoding in messages array
199
- - Timeout: ~6 seconds for 2.1MB image
200
 
201
- #### Step 1.1: Implement `analyze_image_hf()` in vision.py
202
-
203
- - [ ] Add function signature matching existing pattern
204
- - [ ] Use **google/gemma-3-27b-it:scaleway** (validated, fastest)
205
- - [ ] Format: Base64 encode images in messages array
206
- - [ ] Add retry logic with exponential backoff (3 attempts)
207
- - [ ] Handle API errors with clear error messages
208
- - [ ] Set 120s timeout for large images
209
- - [ ] **NO fallback logic** - fail loudly for debugging
210
-
211
- #### Step 1.2: Fix Vision Tool Routing (NO FALLBACKS)
212
-
213
- - [ ] Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
214
- - [ ] Add routing logic (each provider fails independently):
215
-
216
- ```python
217
- if provider == "huggingface":
218
- return analyze_image_hf(image_path, question) # Fail if error
219
- elif provider == "gemini":
220
- return analyze_image_gemini(image_path, question) # Fail if error
221
- elif provider == "claude":
222
- return analyze_image_claude(image_path, question) # Fail if error
223
- # NO fallback chains during testing - defeats isolation purpose
224
- ```
225
 
226
- - [ ] Log exact failure reason for debugging
227
- - [ ] Add placeholder for `groq` (future Phase 4)
 
228
 
229
- #### Step 1.3: Update Configuration
 
 
 
230
 
231
- - [ ] Add `HF_VISION_MODEL=CohereLabs/aya-vision-32b` to .env (validated from Phase 0)
232
- - [ ] Update `src/config/settings.py` with vision model setting
233
- - [ ] Document alternatives (Qwen/Qwen3-VL-8B-Instruct for small images only)
234
 
235
  ---
236
 
237
- ### Phase 2: Smoke Tests (Before GAIA Evaluation)
238
-
239
- **Goal:** Validate basic vision works before complex GAIA questions
240
-
241
- **Decision Gate 2:** Only proceed to Phase 3 if ≥3/4 smoke tests pass
242
-
243
- #### Step 2.1: Simple Image Description Test
244
 
245
- - [ ] Test image: Photo of apple
246
- - [ ] Question: "Describe this image"
247
- - [ ] Expected: Basic object recognition works
248
- - [ ] Export: `output/smoke_test_description.json`
249
 
250
- #### Step 2.2: OCR Test
251
 
252
- - [ ] Test image: Image with text "Hello World"
253
- - [ ] Question: "What text do you see?"
254
- - [ ] Expected: Text extraction works
255
- - [ ] Export: `output/smoke_test_ocr.json`
256
 
257
- #### Step 2.3: Counting Test
258
 
259
- - [ ] Test image: Image with 3 distinct objects
260
- - [ ] Question: "How many objects are visible?"
261
- - [ ] Expected: Visual reasoning works
262
- - [ ] Export: `output/smoke_test_counting.json`
263
 
264
- #### Step 2.4: Single GAIA Question Test
265
 
266
- - [ ] Select easiest GAIA vision question
267
- - [ ] Run with HuggingFace provider
268
- - [ ] Verify end-to-end integration works
269
- - [ ] Export: `output/smoke_test_gaia_single.json`
270
-
271
- #### Step 2.5: Decision Gate - GO/NO-GO
272
-
273
- - [ ] **GO:** ≥3/4 smoke tests pass → Proceed to Phase 3
274
- - [ ] **NO-GO:** <3/4 pass → Debug before GAIA evaluation
275
 
276
  ---
277
 
278
- ### Phase 3: GAIA Evaluation (Only if Smoke Tests Pass)
279
-
280
- **Goal:** Test HuggingFace vision on full GAIA benchmark
281
-
282
- #### Step 3.1: Run Full GAIA Evaluation (HuggingFace Only)
283
-
284
- - [ ] Set `LLM_PROVIDER=huggingface` in UI
285
- - [ ] Run all 20 questions
286
- - [ ] Export: `output/gaia_results_hf_TIMESTAMP.json` (HF only, no mixing)
287
- - [ ] Log which questions use vision tool vs other tools
288
 
289
- #### Step 3.2: Analyze Results
290
 
291
- - [ ] Calculate accuracy: X/20 correct
292
- - [ ] Break down by question type:
293
- - Vision questions: X/8 correct
294
- - Non-vision questions: X/12 correct
295
- - [ ] Identify failure patterns (vision errors, wrong answers, tool selection errors)
296
- - [ ] Compare to 0% baseline
297
 
298
- #### Step 3.3: Build Capability Matrix
299
 
300
- - [ ] Document per-provider results:
 
 
 
301
 
302
- | Provider | Vision Questions | Accuracy | Notes |
303
- | --------------------- | ---------------- | -------- | -------------- |
304
- | HuggingFace (Phi-3.5) | 8/8 attempted | X% | [observations] |
305
- | Gemini (baseline) | 8/8 attempted | Y% | [comparison] |
306
 
307
- #### Step 3.4: Decision Gate - Optimization Decision
308
-
309
- - [ ] **If accuracy ≥20%:** Good enough, proceed to Phase 4 (media processing)
310
- - [ ] **If accuracy <20%:** Analyze failures, try larger HF model (Llama-3.2 or Qwen2-VL)
311
- - [ ] **If accuracy <5%:** Re-evaluate approach, consider backup options
312
 
313
  ---
314
 
315
- ### Phase 4: Media Processing Gaps (After Vision Works)
316
-
317
- **Goal:** Add YouTube and audio support
318
 
319
- #### Step 4.1: YouTube Video Support
320
 
321
- - [ ] Add YouTube transcript extraction tool
322
- - [ ] Use `youtube-transcript-api` library
323
- - [ ] Extract dialogue/captions as text
324
- - [ ] Pass transcript to LLM for question answering
325
- - [ ] Test on GAIA YouTube questions (bird species, Stargate quote)
326
- - [ ] Export: `output/gaia_results_hf_with_youtube.json`
327
 
328
- #### Step 4.2: Audio File Support
 
 
329
 
330
- - [ ] Add audio transcription tool
331
- - [ ] Use OpenAI Whisper or HuggingFace audio models
332
- - [ ] Transcribe audio text
333
- - [ ] Pass transcript to LLM
334
- - [ ] Test on GAIA audio question (Strawberry pie.mp3)
335
- - [ ] Export: `output/gaia_results_hf_with_audio.json`
336
-
337
- ---
338
 
339
- ### Phase 5: Groq Vision Integration (Future)
340
-
341
- **Goal:** Add free tier fallback option
342
-
343
- #### Step 5.1: Add Groq Vision Support
344
-
345
- - [ ] Implement `analyze_image_groq()` using Llama-3.2-90B-Vision
346
- - [ ] Add to vision tool routing (independent, no fallback)
347
- - [ ] Test with Groq free tier (30 req/min)
348
- - [ ] Export: `output/gaia_results_groq_TIMESTAMP.json`
349
- - [ ] Compare accuracy: HF vs Groq
350
 
351
  ---
352
 
353
- ### Phase 6: Final Verification
354
-
355
- **Goal:** Document final results and verify all tests pass
356
-
357
- #### Step 6.1: Final GAIA Evaluation (All Media Types)
358
-
359
- - [ ] Run all 20 questions with HuggingFace
360
- - [ ] Verify: images, videos, audio all work
361
- - [ ] Export: `output/gaia_results_final_TIMESTAMP.json`
362
- - [ ] Document final accuracy vs 0% baseline
363
-
364
- #### Step 6.2: Regression Testing
365
-
366
- - [ ] Run all 99 tests
367
- - [ ] Verify no regressions introduced
368
- - [ ] Fix any broken tests
369
-
370
- #### Step 6.3: Documentation
371
-
372
- - [ ] Update CHANGELOG.md with final results
373
- - [ ] Update README.md with HF vision support
374
- - [ ] Document model selection strategy
375
-
376
- ## Files to Modify
377
-
378
- ### Phase 0-1: Core Vision Integration
379
-
380
- 1. **src/tools/vision.py** (~150 lines added/modified)
381
-
382
- - Add `analyze_image_hf()` function (Phase 1)
383
- - Modify `analyze_image()` routing logic - NO FALLBACKS (Phase 1)
384
- - Add retry logic with exponential backoff
385
- - Clear error messages for debugging
386
-
387
- 2. **.env** (~3 lines added)
388
-
389
- - Add `HF_VISION_MODEL=microsoft/Phi-3.5-vision-instruct` (start small)
390
- - Document alternatives: Llama-3.2-11B-Vision, Qwen2-VL-72B
391
-
392
- 3. **src/config/settings.py** (~5 lines)
393
- - Add `hf_vision_model` setting
394
- - Load from environment variable
395
-
396
- ### Phase 2-3: Testing Infrastructure
397
-
398
- 1. **test/test_vision_smoke.py** (NEW - ~100 lines)
399
-
400
- - Smoke test suite: description, OCR, counting, single GAIA
401
- - Export individual test results
402
-
403
- 2. **app.py** (optional - ~10 lines)
404
- - Update export filenames to include provider: `gaia_results_hf_TIMESTAMP.json`
405
- - Separate results per provider for capability matrix
406
-
407
- ### Phase 4: Media Processing
408
-
409
- 1. **src/tools/youtube.py** (NEW - ~80 lines)
410
-
411
- - YouTube transcript extraction
412
- - Use `youtube-transcript-api`
413
-
414
- 2. **src/tools/audio.py** (NEW - ~80 lines)
415
-
416
- - Audio transcription (Whisper or HF audio models)
417
- - Convert audio → text
418
-
419
- 3. **src/tools/**init**.py** (~10 lines)
420
-
421
- - Register new tools: youtube_transcript, audio_transcribe
422
-
423
- 4. **requirements.txt** (~3 lines)
424
- - Add `youtube-transcript-api`
425
- - Add `openai-whisper` or HF audio model library
426
-
427
- ### Phase 6: Documentation
428
-
429
- 1. **README.md** (~30 lines modified)
430
- - Document HF vision support
431
- - List model options and selection strategy
432
- - Update architecture diagram with media processing tools
433
-
434
  ## Success Criteria
435
 
436
- ### Phase 0: API Validation
437
-
438
- - [ ] At least 1 HF vision model works with Inference API
439
- - [ ] Image format documented (base64/URL/file)
440
- - [ ] Response format documented
441
-
442
- ### Phase 1: Implementation
443
-
444
- - [ ] `analyze_image_hf()` function implemented
445
- - [ ] Vision tool routing respects `LLM_PROVIDER` (NO FALLBACKS)
446
- - [ ] Clear error messages when provider fails
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
447
 
448
- ### Phase 2: Smoke Tests
449
-
450
- - [ ] ≥3/4 smoke tests pass
451
- - [ ] Basic vision capabilities validated
452
-
453
- ### Phase 3: GAIA Evaluation
454
-
455
- - [ ] UI LLM selection propagates to vision tool
456
- - [ ] HuggingFace-only results exported: `output/gaia_results_hf_TIMESTAMP.json`
457
- - [ ] Accuracy measured and compared to 0% baseline
458
- - [ ] Capability matrix built (per-provider comparison)
459
-
460
- ### Phase 4-6: Full Coverage
461
-
462
- - [ ] YouTube video questions work (transcript extraction)
463
- - [ ] Audio questions work (transcription)
464
- - [ ] All 99 tests still passing
465
- - [ ] Final accuracy ≥20% (minimum acceptable)
466
-
467
- ## Backup Strategy Options
468
-
469
- If Phase 0 reveals HF Inference API doesn't support vision:
470
-
471
- ### Option C: HuggingFace Spaces Deployment
472
-
473
- - Deploy custom vision model to HF Spaces
474
- - Use Inference Endpoints (paid tier)
475
- - More control, higher cost
476
-
477
- ### Option D: Local Transformers Library
478
-
479
- - Use `transformers` library directly (no API)
480
- - Load model locally: `AutoModelForVision2Seq`
481
- - Slower, requires GPU, but guaranteed to work
482
-
483
- ### Option E: Hybrid Architecture
484
-
485
- - Keep HuggingFace for text-only LLM
486
- - Use Gemini/Claude for vision only
487
- - Compromise: HF testing focus, but vision delegates to working providers
488
-
489
- ## Decision Gates Summary
490
-
491
- **Gate 1 (Phase 0):** Does HF API support vision?
492
-
493
- - **GO:** ≥1 model works → Phase 1
494
- - **NO-GO:** 0 models work → Pivot to Option C/D/E
495
-
496
- **Gate 2 (Phase 2):** Do smoke tests pass?
497
 
498
- - **GO:** ≥3/4 pass → Phase 3
499
- - **NO-GO:** <3/4 pass → Debug before GAIA
 
 
500
 
501
- **Gate 3 (Phase 3):** Is accuracy acceptable?
 
 
 
502
 
503
- - **GO:** ≥20% → Phase 4 (media processing)
504
- - **ITERATE:** <20% Try larger model or analyze failures
505
- - **PIVOT:** <5% Re-evaluate approach
506
 
507
- ## Phase 0 Research Questions (Answer These First)
508
 
509
- 1. **Does HF Inference API support vision models?**
 
 
 
 
 
 
510
 
511
- - Test Phi-3.5-vision-instruct with simple image
512
- - Test Llama-3.2-11B-Vision-Instruct
513
- - Test Qwen2-VL-72B-Instruct
514
 
515
- 2. **What's the image input format?**
 
 
516
 
517
- - Base64 encoding in messages?
518
- - Direct URL support?
519
- - File path support?
520
 
521
- 3. **What's the response structure?**
522
- - JSON schema format
523
- - Error patterns
524
- - Rate limits and quotas
525
 
526
  ## Next Actions
527
 
528
- **Phase 0 starts with:**
529
 
530
- 1. ==Research HF Inference API documentation for vision support==
531
- 2. Test simple vision API call with Phi-3.5-vision-instruct
532
- 3. Document working pattern or confirm API doesn't support vision
533
- 4. Decision gate: GO to Phase 1 or pivot to backup options
 
 
534
 
535
- ---
536
 
537
- ## Phase 7: GAIA File Attachment Support
538
-
539
- **Goal:** Enable agent to download and process file attachments from GAIA questions
540
-
541
- **Problem:**
542
- - Current code ignores `file_name` field in GAIA questions
543
- - Files not downloaded from `GET /files/{task_id}` endpoint
544
- - Vision/file parsing tools fail with placeholder `<provided_image_path>`
545
- - ~40% of questions (8/20) fail due to missing file handling
546
-
547
- ### Root Cause
548
-
549
- **GAIA Question Structure:**
550
- ```json
551
- {
552
- "task_id": "abc123",
553
- "question": "What's in this image?",
554
- "file_name": "chess.png", // NULL if no file
555
- "file_path": "/files/abc123" // NULL if no file
556
- }
557
- ```
558
-
559
- **Current Code (app.py:249-290):**
560
- ```python
561
- def process_single_question(agent, item, index, total):
562
- task_id = item.get("task_id")
563
- question_text = item.get("question")
564
- # ❌ MISSING: Check file_name
565
- # ❌ MISSING: Download file
566
- # ❌ MISSING: Pass file_path to agent
567
-
568
- submitted_answer = agent(question_text) # No file handling
569
- ```
570
-
571
- **Result:** LLM generates `vision(image_path="<provided_image_path>")` → File not found error
572
-
573
- ### Solution Architecture
574
-
575
- **Step 1: Add File Download Function**
576
-
577
- ```python
578
- def download_task_file(task_id: str, save_dir: str = "input/") -> Optional[str]:
579
- """Download file attached to a GAIA question.
580
-
581
- Args:
582
- task_id: Question's task_id
583
- save_dir: Directory to save file
584
-
585
- Returns:
586
- File path if downloaded, None if no file
587
- """
588
- api_url = "https://agents-course-unit4-scoring.hf.space"
589
- file_url = f"{api_url}/files/{task_id}"
590
-
591
- response = requests.get(file_url, timeout=30)
592
- response.raise_for_status()
593
-
594
- # Get extension from Content-Type header
595
- content_type = response.headers.get('Content-Type', '')
596
- extension_map = {
597
- 'image/png': '.png',
598
- 'image/jpeg': '.jpg',
599
- 'application/pdf': '.pdf',
600
- 'text/csv': '.csv',
601
- 'application/json': '.json',
602
- 'application/vnd.ms-excel': '.xls',
603
- 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': '.xlsx',
604
- }
605
- extension = extension_map.get(content_type, '.file')
606
-
607
- # Save file
608
- Path(save_dir).mkdir(exist_ok=True)
609
- file_path = f"{save_dir}{task_id}{extension}"
610
- with open(file_path, 'wb') as f:
611
- f.write(response.content)
612
-
613
- return file_path
614
- ```
615
-
616
- **Step 2: Modify Question Processing**
617
-
618
- ```python
619
- def process_single_question(agent, item, index, total):
620
- task_id = item.get("task_id")
621
- question_text = item.get("question")
622
- file_name = item.get("file_name") # ✅ NEW
623
-
624
- # Download file if exists
625
- file_path = None
626
- if file_name:
627
- file_path = download_task_file(task_id)
628
-
629
- # Pass file info to agent
630
- submitted_answer = agent(question_text, file_path=file_path) # ✅ NEW
631
- ```
632
-
633
- **Step 3: Update LLM Context**
634
-
635
- When file_path is provided, include it in the planning prompt:
636
- ```python
637
- if file_path:
638
- question_context = f"Question: {question_text}\nAttached file: {file_path}"
639
- else:
640
- question_context = question_text
641
- ```
642
-
643
- ### Implementation Steps
644
-
645
- #### Step 7.1: Add File Download Function
646
-
647
- - [ ] Create `download_task_file()` in app.py
648
- - [ ] Handle Content-Type to extension mapping
649
- - [ ] Handle 404 gracefully (no file for this task)
650
- - [ ] Create input/ directory if not exists
651
-
652
- #### Step 7.2: Modify Question Processing Loop
653
-
654
- - [ ] Check `item.get("file_name")` in process_single_question
655
- - [ ] Call download_task_file() if file_name exists
656
- - [ ] Pass file_path to agent invocation
657
-
658
- #### Step 7.3: Update Agent to Handle file_path
659
-
660
- - [ ] Modify agent to accept optional file_path parameter
661
- - [ ] Include file info in planning prompt
662
- - [ ] Update tool selection to use real file paths
663
-
664
- #### Step 7.4: Test File Handling
665
-
666
- - [ ] Test with image question (chess position)
667
- - [ ] Test with document question (Excel file)
668
- - [ ] Verify no more `<provided_image_path>` errors
669
-
670
- ### Files to Modify
671
-
672
- 1. **app.py** (~80 lines added/modified)
673
- - Add download_task_file() function
674
- - Modify process_single_question() to handle files
675
- - Add input/ directory creation
676
-
677
- 2. **src/agent/graph.py** (~20 lines)
678
- - Update agent state to include file_path
679
- - Pass file info to planning prompt
680
-
681
- 3. **.gitignore** (~2 lines)
682
- - Add input/ to ignore downloaded files
683
-
684
- ### Success Criteria
685
 
686
- - [ ] Image questions: Vision tool receives real file path
687
- - [ ] Document questions: parse_file tool receives real file path
688
- - [ ] No more `<provided_image_path>` errors
689
- - [ ] Accuracy improves from 10% toward 30%+
690
 
691
- ### Expected Impact
 
 
692
 
693
- | Before | After |
694
- |--------|-------|
695
- | 40% (8/20) fail with file errors | 0% file errors |
696
- | Vision questions: All fail | Vision questions: Can work |
697
- | Document questions: All fail | Document questions: Can work |
698
- | Max accuracy: ~60% | Max accuracy: ~100% potential |
 
1
+ # Implementation Plan - System Error Fixes for 30% Target
2
 
3
+ **Date:** 2026-01-13
4
+ **Status:** Active
5
+ **Current Score:** 10% (2/20 correct)
6
+ **Target:** 30% (6/20 correct)
7
 
8
  ## Objective
9
 
10
+ Fix remaining 6 system errors to unlock questions, then address LLM quality issues to reach 30% target (6/20 correct).
11
 
12
+ ## Current Status Analysis
13
 
14
+ ### Working (2/20 correct - 10%)
 
 
15
 
16
+ | # | Task | Status | Issue |
17
+ |---|------|--------|-------|
18
+ | 9 | Polish Ray actor | ✅ Correct | - |
19
+ | 15 | Vietnamese specimens | ✅ Correct | - |
20
 
21
+ ### ⚠️ System Errors (6/20 - Technical issues blocking)
22
 
23
+ | # | Task | Error | Type | Priority |
24
+ |---|------|-------|------|----------|
25
+ | **3** | YouTube video (bird species) | Vision tool can't handle URLs | Technical | **HIGH** |
26
+ | **5** | YouTube video (Teal'c) | Vision tool can't handle URLs | Technical | **HIGH** |
27
+ | **6** | CSV table (commutativity) | LLM tries to load `table_data.csv` | LLM Quality | MED |
28
+ | **10** | MP3 audio (pie recipe) | Unsupported file type | Technical | **MED** |
29
+ | **12** | Python code execution | Unsupported file type | Technical | **LOW** |
30
+ | **13** | MP3 audio (calculus) | Unsupported file type | Technical | **MED** |
31
 
32
+ ### ❌ LLM Quality Issues (12/20 - AI can't solve)
33
 
34
+ | # | Task | Answer | Expected | Type |
35
+ |---|------|--------|----------|------|
36
+ | 1 | Calculator | "Unable to answer" | Right | Reasoning |
37
+ | 2 | Wikipedia dinosaur | "Scott Hartman" | FunkMonk | Knowledge |
38
+ | 4 | Mercedes Sosa albums | "Unable to answer" | 3 | Knowledge |
39
+ | 7 | Chess position | "Unable to answer" | Rd5 | Vision+Reasoning |
40
+ | 8 | Grocery list (botany) | Wrong (includes fruits) | 5 items | Knowledge |
41
+ | 11 | Equine veterinarian | "Unable to answer" | Louvrier | Knowledge |
42
+ | 14 | NASA award | "Unable to answer" | 80GSFC21M0002 | Knowledge |
43
+ | 16 | Yankee at-bats | "Unable to answer" | 519 | Knowledge |
44
+ | 17 | Pitcher numbers | "Unable to answer" | Yoshida, Uehara | Knowledge |
45
+ | 18 | Olympics athletes | "Unable to answer" | CUB | Knowledge |
46
+ | 19 | Malko Competition | "Unable to answer" | Claus | Knowledge |
47
+ | 20 | Excel sales | "12096.00" | "89706.00" | Calculation |
48
 
49
+ ## Strategy
 
 
 
50
 
51
+ **Priority 1: Fix System Errors** (unlock 6 questions)
52
+ - YouTube videos (2 questions) - HIGH impact
53
+ - MP3 audio (2 questions) - Medium impact
54
+ - Python execution (1 question) - Low impact
55
+ - CSV table - LLM issue, not technical
56
 
57
+ **Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
58
+ - Better prompting
59
+ - Tool selection improvements
60
+ - Reasoning enhancements
61
 
62
+ ## Implementation Plan
63
 
64
+ ### Phase 1: YouTube Video Support (HIGH Priority)
65
 
66
+ **Goal:** Fix questions #3 and #5 (YouTube videos)
67
 
68
+ **Root Cause:** Vision tool tries to process YouTube URLs directly, but:
69
+ - YouTube videos need to be downloaded first
70
+ - Vision tool expects image files, not video URLs
71
+ - Need to extract frames or use transcript
72
 
73
+ **Solution Options:**
74
 
75
+ #### Option A: YouTube Transcript (Recommended)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  **Implementation:**
78
+ ```python
79
+ # NEW: src/tools/youtube.py
80
+ import youtube_transcript_api
81
+
82
+ def get_youtube_transcript(video_url: str) -> str:
83
+ """Extract transcript from YouTube video."""
84
+ try:
85
+ video_id = extract_video_id(video_url)
86
+ transcript = YouTubeTranscriptApi.get_transcript(video_id)
87
+ return format_transcript(transcript)
88
+ except Exception as e:
89
+ return f"ERROR: Could not extract transcript: {e}"
90
+ ```
91
 
92
  **Pros:**
93
+ - ✅ Works with current LLM (text-based)
94
+ - ✅ Simple API (youtube-transcript-api library)
95
+ - ✅ Fast, no video download needed
96
+ - ✅ Solves both #3 and #5
 
97
 
98
  **Cons:**
99
+ - ❌ Won't work for visual-only questions (but our questions are about content)
100
+ - ❌ Might not capture visual details
101
 
102
+ **Decision:** Use transcript approach since questions ask about content (bird species, dialogue)
 
 
 
 
 
 
 
103
 
104
+ #### Option B: Video Frame Extraction
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  **Implementation:**
107
+ - Download video (yt-dlp)
108
+ - Extract key frames (OpenCV)
109
+ - Pass frames to vision tool
110
 
111
+ **Pros:** Visual analysis
112
+ **Cons:** Slow, complex, overkill for content questions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
+ #### Step 1.1: Install youtube-transcript-api
115
 
116
+ ```bash
117
+ uv add youtube-transcript-api
118
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
+ #### Step 1.2: Create YouTube tool
 
 
 
121
 
122
+ ```python
123
+ # src/tools/youtube.py
124
+ def youtube_transcript(video_url: str) -> str:
125
+ """Extract transcript from YouTube video."""
126
+ ```
127
 
128
+ #### Step 1.3: Register tool
 
 
 
 
129
 
130
+ ```python
131
+ # src/tools/__init__.py
132
+ TOOLS = [
133
+ ...
134
+ {"name": "youtube_transcript", "func": youtube_transcript,
135
+ "description": "Extract transcript from YouTube video URL. Use when question mentions YouTube video content like dialogue, speech, or visual descriptions."},
136
+ ]
137
+ ```
138
 
139
+ #### Step 1.4: Test
140
 
141
+ ```bash
142
+ # Test on question #3
143
+ Target Task ID: a1e91b78-d3d8-4675-bb8d-62741b4b68a6
144
+ ```
 
 
145
 
146
+ **Expected impact:** +2 questions (30% 40% if both work)
 
147
 
148
  ---
149
 
150
+ ### Phase 2: MP3 Audio Support (MEDIUM Priority)
151
 
152
+ **Goal:** Fix questions #10 and #13 (MP3 audio files)
153
 
154
+ **Root Cause:** parse_file doesn't support .mp3
155
 
156
+ **Solution:** Add audio transcription tool
 
 
157
 
158
+ **Implementation:**
159
+ ```python
160
+ # NEW: src/tools/audio.py
161
+ import whisper
162
+
163
+ def transcribe_audio(file_path: str) -> str:
164
+ """Transcribe audio file to text using OpenAI Whisper."""
165
+ model = whisper.load_model("base")
166
+ result = model.transcribe(file_path)
167
+ return result["text"]
168
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
170
+ **Alternative:** HuggingFace audio models (free)
171
+ - `openai/whisper-base`
172
+ - Use via Inference API
173
 
174
+ **Step 2.1:** Choose implementation (Whisper vs HF)
175
+ **Step 2.2:** Implement audio tool
176
+ **Step 2.3:** Add to TOOLS registry
177
+ **Step 2.4:** Test on #10 and #13
178
 
179
+ **Expected impact:** +2 questions (30% 40% if both work)
 
 
180
 
181
  ---
182
 
183
+ ### Phase 3: Python Code Execution (LOW Priority)
 
 
 
 
 
 
184
 
185
+ **Goal:** Fix question #12 (Python code output)
 
 
 
186
 
187
+ **Root Cause:** parse_file doesn't support .py execution
188
 
189
+ **Solution:** Add code execution tool (sandboxed)
 
 
 
190
 
191
+ **Security Concern:** ⚠️ **DANGEROUS** - executing arbitrary Python code
192
 
193
+ **Options:**
194
+ 1. **Restricted execution** - Only allow specific operations
195
+ 2. **Docker container** - Isolate execution
196
+ 3. **Skip for now** - Defer due to security concerns
197
 
198
+ **Decision:** Mark as **DEFERRED** due to security complexity
199
 
200
+ **Expected impact:** +1 question (if implemented)
 
 
 
 
 
 
 
 
201
 
202
  ---
203
 
204
+ ### Phase 4: CSV Table Issue (LLM Quality)
 
 
 
 
 
 
 
 
 
205
 
206
+ **Goal:** Fix question #6 (table commutativity)
207
 
208
+ **Root Cause:** LLM tries to load `table_data.csv` when data is IN the question
 
 
 
 
 
209
 
210
+ **Solution:** This is NOT technical - LLM needs better prompts or tool selection
211
 
212
+ **Approaches:**
213
+ 1. Improve system prompt to recognize data in questions
214
+ 2. Add hint in question preprocessing
215
+ 3. Special handling for markdown tables in questions
216
 
217
+ **Current workaround:** System correctly identifies as "no_evidence" and doesn't crash
 
 
 
218
 
219
+ **Status:** Defer to LLM quality improvements (Phase 5)
 
 
 
 
220
 
221
  ---
222
 
223
+ ### Phase 5: LLM Quality Improvements
 
 
224
 
225
+ **Goal:** Convert "Unable to answer" → correct answers
226
 
227
+ **Target questions (by category):**
 
 
 
 
 
228
 
229
+ **Knowledge/Research (9 questions):** #2, #4, #11, #14, #16, #17, #18, #19
230
+ **Reasoning/Calculation (2 questions):** #1, #20
231
+ **Vision+Reasoning (1 question):** #7
232
 
233
+ **Approaches:**
234
+ 1. **Better prompts** - Emphasize exact answer format
235
+ 2. **Tool selection hints** - Guide LLM to use appropriate tools
236
+ 3. **Few-shot examples** - Show LLM expected answer format
237
+ 4. **Chain-of-thought** - Encourage step-by-step reasoning
 
 
 
238
 
239
+ **Implementation:**
240
+ - Update `synthesize_answer()` prompt
241
+ - Add answer format examples to system prompt
242
+ - Improve tool descriptions for better selection
 
 
 
 
 
 
 
243
 
244
  ---
245
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
246
  ## Success Criteria
247
 
248
+ ### Phase 1: YouTube Support
249
+ - [ ] YouTube transcript tool implemented
250
+ - [ ] Question #3 answered correctly (bird species = "3")
251
+ - [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
252
+ - [ ] **Score: 10% → 40% (4/20)** ✅ TARGET REACHED
253
+
254
+ ### Phase 2: MP3 Support
255
+ - [ ] Audio transcription tool implemented
256
+ - [ ] Question #10 answered correctly (pie ingredients)
257
+ - [ ] Question #13 answered correctly (page numbers)
258
+ - [ ] **Score: 40% 50% (10/20)** ✅ EXCEEDS TARGET
259
+
260
+ ### Phase 3: Python Execution
261
+ - [ ] Code execution tool implemented (sandboxed)
262
+ - [ ] Question #12 answered correctly (output = "0")
263
+ - [ ] **Score: 50% → 55% (11/20)**
264
+
265
+ ### Phase 4: CSV Table
266
+ - [ ] LLM recognizes data in question
267
+ - [ ] Question #6 answered correctly ("b, e")
268
+ - [ ] **Score: 55% → 60% (12/20)**
269
+
270
+ ### Phase 5: LLM Quality
271
+ - [ ] "Unable to answer" reduced by 50%
272
+ - [ ] At least 3 more knowledge questions correct
273
+ - [ ] **Score: 60% → 75%+ (15/20)**
274
 
275
+ ## Files to Modify
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
 
277
+ ### Phase 1: YouTube
278
+ 1. **requirements.txt** - Add `youtube-transcript-api`
279
+ 2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
280
+ 3. **src/tools/__init__.py** - Register youtube_transcript tool
281
 
282
+ ### Phase 2: MP3 Audio
283
+ 1. **requirements.txt** - Add `openai-whisper` or HF audio
284
+ 2. **src/tools/audio.py** (NEW) - Audio transcription
285
+ 3. **src/tools/__init__.py** - Register transcribe_audio tool
286
 
287
+ ### Phase 3-5: LLM Quality
288
+ 1. **src/agent/graph.py** - Update prompts
289
+ 2. **src/tools/__init__.py** - Improve tool descriptions
290
 
291
+ ## Removed (Not Relevant)
292
 
293
+ - ~~Phase 0: Vision API validation~~ (already using Gemma 3)
294
+ - ~~Phase 1: HuggingFace vision~~ (not current priority)
295
+ - ~~Phase 2: Smoke tests~~ (already working)
296
+ - ~~Phase 3: GAIA evaluation~~ (running successfully)
297
+ - ~~Phase 5: Groq vision~~ (fallback archived)
298
+ - ~~Phase 6: Final verification~~ (premature)
299
+ - ~~Phase 7: File attachment~~ (already implemented)
300
 
301
+ ## Decision Gates
 
 
302
 
303
+ **Gate 1 (YouTube):** Does transcript solve both video questions?
304
+ - **YES:** 40% score, proceed to Phase 2
305
+ - **NO:** Try frame extraction approach
306
 
307
+ **Gate 2 (MP3):** Does transcription solve both audio questions?
308
+ - **YES:** 50% score, proceed to Phase 3
309
+ - **NO:** Try different audio model
310
 
311
+ **Gate 3 (Target):** Have we reached 30% (6/20)?
312
+ - **YES:** SUCCESS - course target met
313
+ - **NO:** Continue to Phase 4-5
 
314
 
315
  ## Next Actions
316
 
317
+ **Start with Phase 1 (YouTube):**
318
 
319
+ 1. [ ] Install youtube-transcript-api
320
+ 2. [ ] Create src/tools/youtube.py
321
+ 3. [ ] Add youtube_transcript to TOOLS
322
+ 4. [ ] Test on question #3: `a1e91b78-d3d8-4675-bb8d-62741b4b68a6`
323
+ 5. [ ] Run full evaluation
324
+ 6. [ ] Verify 40% score (4/20 correct)
325
 
326
+ **After YouTube:** Proceed to MP3 support (Phase 2)
327
 
328
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
329
 
330
+ ## Backup Options
 
 
 
331
 
332
+ If YouTube transcript doesn't work:
333
+ - **Plan B:** Extract video frames, analyze with vision tool
334
+ - **Plan C:** Skip video questions, focus on other fixes
335
 
336
+ If MP3 transcription doesn't work:
337
+ - **Plan B:** Use HuggingFace audio models
338
+ - **Plan C:** Skip audio questions, focus on LLM quality
 
 
 
README.md CHANGED
@@ -347,7 +347,7 @@ ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
347
 
348
  **Test Coverage:** 99 passing tests (~2min 40sec runtime)
349
 
350
- > **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
351
 
352
  ## Workflow
353
 
 
347
 
348
  **Test Coverage:** 99 passing tests (~2min 40sec runtime)
349
 
350
+ > **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
351
 
352
  ## Workflow
353
 
_template_original/README.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Template Final Assignment
3
+ emoji: 🕵🏻‍♂️
4
+ colorFrom: indigo
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 5.25.2
8
+ app_file: app.py
9
+ pinned: false
10
+ hf_oauth: true
11
+ # optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
12
+ hf_oauth_expiration_minutes: 480
13
+ ---
14
+
15
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
_template_original/app.py ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import requests
4
+ import inspect
5
+ import pandas as pd
6
+
7
+ # (Keep Constants as is)
8
+ # --- Constants ---
9
+ DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
10
+
11
+ # --- Basic Agent Definition ---
12
+ # ----- THIS IS WERE YOU CAN BUILD WHAT YOU WANT ------
13
+ class BasicAgent:
14
+ def __init__(self):
15
+ print("BasicAgent initialized.")
16
+ def __call__(self, question: str) -> str:
17
+ print(f"Agent received question (first 50 chars): {question[:50]}...")
18
+ fixed_answer = "This is a default answer."
19
+ print(f"Agent returning fixed answer: {fixed_answer}")
20
+ return fixed_answer
21
+
22
+ def run_and_submit_all( profile: gr.OAuthProfile | None):
23
+ """
24
+ Fetches all questions, runs the BasicAgent on them, submits all answers,
25
+ and displays the results.
26
+ """
27
+ # --- Determine HF Space Runtime URL and Repo URL ---
28
+ space_id = os.getenv("SPACE_ID") # Get the SPACE_ID for sending link to the code
29
+
30
+ if profile:
31
+ username= f"{profile.username}"
32
+ print(f"User logged in: {username}")
33
+ else:
34
+ print("User not logged in.")
35
+ return "Please Login to Hugging Face with the button.", None
36
+
37
+ api_url = DEFAULT_API_URL
38
+ questions_url = f"{api_url}/questions"
39
+ submit_url = f"{api_url}/submit"
40
+
41
+ # 1. Instantiate Agent ( modify this part to create your agent)
42
+ try:
43
+ agent = BasicAgent()
44
+ except Exception as e:
45
+ print(f"Error instantiating agent: {e}")
46
+ return f"Error initializing agent: {e}", None
47
+ # In the case of an app running as a hugging Face space, this link points toward your codebase ( usefull for others so please keep it public)
48
+ agent_code = f"https://huggingface.co/spaces/{space_id}/tree/main"
49
+ print(agent_code)
50
+
51
+ # 2. Fetch Questions
52
+ print(f"Fetching questions from: {questions_url}")
53
+ try:
54
+ response = requests.get(questions_url, timeout=15)
55
+ response.raise_for_status()
56
+ questions_data = response.json()
57
+ if not questions_data:
58
+ print("Fetched questions list is empty.")
59
+ return "Fetched questions list is empty or invalid format.", None
60
+ print(f"Fetched {len(questions_data)} questions.")
61
+ except requests.exceptions.RequestException as e:
62
+ print(f"Error fetching questions: {e}")
63
+ return f"Error fetching questions: {e}", None
64
+ except requests.exceptions.JSONDecodeError as e:
65
+ print(f"Error decoding JSON response from questions endpoint: {e}")
66
+ print(f"Response text: {response.text[:500]}")
67
+ return f"Error decoding server response for questions: {e}", None
68
+ except Exception as e:
69
+ print(f"An unexpected error occurred fetching questions: {e}")
70
+ return f"An unexpected error occurred fetching questions: {e}", None
71
+
72
+ # 3. Run your Agent
73
+ results_log = []
74
+ answers_payload = []
75
+ print(f"Running agent on {len(questions_data)} questions...")
76
+ for item in questions_data:
77
+ task_id = item.get("task_id")
78
+ question_text = item.get("question")
79
+ if not task_id or question_text is None:
80
+ print(f"Skipping item with missing task_id or question: {item}")
81
+ continue
82
+ try:
83
+ submitted_answer = agent(question_text)
84
+ answers_payload.append({"task_id": task_id, "submitted_answer": submitted_answer})
85
+ results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": submitted_answer})
86
+ except Exception as e:
87
+ print(f"Error running agent on task {task_id}: {e}")
88
+ results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": f"AGENT ERROR: {e}"})
89
+
90
+ if not answers_payload:
91
+ print("Agent did not produce any answers to submit.")
92
+ return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
93
+
94
+ # 4. Prepare Submission
95
+ submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
96
+ status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
97
+ print(status_update)
98
+
99
+ # 5. Submit
100
+ print(f"Submitting {len(answers_payload)} answers to: {submit_url}")
101
+ try:
102
+ response = requests.post(submit_url, json=submission_data, timeout=60)
103
+ response.raise_for_status()
104
+ result_data = response.json()
105
+ final_status = (
106
+ f"Submission Successful!\n"
107
+ f"User: {result_data.get('username')}\n"
108
+ f"Overall Score: {result_data.get('score', 'N/A')}% "
109
+ f"({result_data.get('correct_count', '?')}/{result_data.get('total_attempted', '?')} correct)\n"
110
+ f"Message: {result_data.get('message', 'No message received.')}"
111
+ )
112
+ print("Submission successful.")
113
+ results_df = pd.DataFrame(results_log)
114
+ return final_status, results_df
115
+ except requests.exceptions.HTTPError as e:
116
+ error_detail = f"Server responded with status {e.response.status_code}."
117
+ try:
118
+ error_json = e.response.json()
119
+ error_detail += f" Detail: {error_json.get('detail', e.response.text)}"
120
+ except requests.exceptions.JSONDecodeError:
121
+ error_detail += f" Response: {e.response.text[:500]}"
122
+ status_message = f"Submission Failed: {error_detail}"
123
+ print(status_message)
124
+ results_df = pd.DataFrame(results_log)
125
+ return status_message, results_df
126
+ except requests.exceptions.Timeout:
127
+ status_message = "Submission Failed: The request timed out."
128
+ print(status_message)
129
+ results_df = pd.DataFrame(results_log)
130
+ return status_message, results_df
131
+ except requests.exceptions.RequestException as e:
132
+ status_message = f"Submission Failed: Network error - {e}"
133
+ print(status_message)
134
+ results_df = pd.DataFrame(results_log)
135
+ return status_message, results_df
136
+ except Exception as e:
137
+ status_message = f"An unexpected error occurred during submission: {e}"
138
+ print(status_message)
139
+ results_df = pd.DataFrame(results_log)
140
+ return status_message, results_df
141
+
142
+
143
+ # --- Build Gradio Interface using Blocks ---
144
+ with gr.Blocks() as demo:
145
+ gr.Markdown("# Basic Agent Evaluation Runner")
146
+ gr.Markdown(
147
+ """
148
+ **Instructions:**
149
+
150
+ 1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
151
+ 2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
152
+ 3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
153
+
154
+ ---
155
+ **Disclaimers:**
156
+ Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
157
+ This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
158
+ """
159
+ )
160
+
161
+ gr.LoginButton()
162
+
163
+ run_button = gr.Button("Run Evaluation & Submit All Answers")
164
+
165
+ status_output = gr.Textbox(label="Run Status / Submission Result", lines=5, interactive=False)
166
+ # Removed max_rows=10 from DataFrame constructor
167
+ results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
168
+
169
+ run_button.click(
170
+ fn=run_and_submit_all,
171
+ outputs=[status_output, results_table]
172
+ )
173
+
174
+ if __name__ == "__main__":
175
+ print("\n" + "-"*30 + " App Starting " + "-"*30)
176
+ # Check for SPACE_HOST and SPACE_ID at startup for information
177
+ space_host_startup = os.getenv("SPACE_HOST")
178
+ space_id_startup = os.getenv("SPACE_ID") # Get SPACE_ID at startup
179
+
180
+ if space_host_startup:
181
+ print(f"✅ SPACE_HOST found: {space_host_startup}")
182
+ print(f" Runtime URL should be: https://{space_host_startup}.hf.space")
183
+ else:
184
+ print("ℹ️ SPACE_HOST environment variable not found (running locally?).")
185
+
186
+ if space_id_startup: # Print repo URLs if SPACE_ID is found
187
+ print(f"✅ SPACE_ID found: {space_id_startup}")
188
+ print(f" Repo URL: https://huggingface.co/spaces/{space_id_startup}")
189
+ print(f" Repo Tree URL: https://huggingface.co/spaces/{space_id_startup}/tree/main")
190
+ else:
191
+ print("ℹ️ SPACE_ID environment variable not found (running locally?). Repo URL cannot be determined.")
192
+
193
+ print("-"*(60 + len(" App Starting ")) + "\n")
194
+
195
+ print("Launching Gradio Interface for Basic Agent Evaluation...")
196
+ demo.launch(debug=True, share=False)
_template_original/requirements.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ gradio
2
+ requests
brainstorming_phase1_youtube.md ADDED
@@ -0,0 +1,345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase 1 Brainstorming - YouTube Transcript Support
2
+
3
+ **Date:** 2026-01-13
4
+ **Status:** Discussion Phase
5
+ **Goal:** Fix questions #3 and #5 (YouTube videos) → 40% score
6
+
7
+ ---
8
+
9
+ ## Question Analysis
10
+
11
+ | Question | Task ID | Description | Expected Answer | Type |
12
+ | -------- | -------------------------------------- | ------------------------------- | --------------- | ------------- |
13
+ | #3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | "3" | Content-based |
14
+ | #5 | (Teal'c quote) | YouTube video - character quote | "Extremely" | Dialogue |
15
+
16
+ **Conclusion:** Both are content-based questions → transcript approach should work ✅
17
+
18
+ ---
19
+
20
+ ## Library Options
21
+
22
+ ### Option A: youtube-transcript-api ⭐ Recommended
23
+
24
+ - **Pros:** Simple API, actively maintained, no video download needed, fast
25
+ - **Cons:** May fail on videos without captions, regional restrictions
26
+ - **Use case:** Start here for simplicity
27
+
28
+ ### Option B: yt-dlp + transcript extraction
29
+
30
+ - **Pros:** More robust, can fall back to auto-generated captions
31
+ - **Cons:** Heavier dependency, slower
32
+ - **Use case:** Backup if Option A has high failure rate
33
+
34
+ ### Option C: Direct YouTube API
35
+
36
+ - **Pros:** Most control
37
+ - **Cons:** Requires API key, more complex
38
+ - **Use case:** Probably overkill for this use case
39
+
40
+ ---
41
+
42
+ ## Frame Extraction: Corrected Analysis
43
+
44
+ **Key insight:** Frame extraction itself is FAST. The "slow" parts are download + vision API processing.
45
+
46
+ ### Actual Timing Breakdown
47
+
48
+ | Step | Time (10-min video) | Notes |
49
+ | -------------------- | ------------------- | -------------------------------------- |
50
+ | **Download** | 30s - 3 min | Network I/O, one-time cost |
51
+ | **Frame extraction** | **5 - 20 sec** | ffmpeg is I/O bound, very efficient ⚡ |
52
+ | **Vision API calls** | 20s - 5 min | Sequential: 600 frames × 2-5s each |
53
+
54
+ **Reality check:** You can extract 600 frames from a local 10-min video in under 15 seconds with ffmpeg. The "slow" part is vision model API calls, not the extraction.
55
+
56
+ **Bottom line:** Frame extraction is cheap compute. Vision processing is expensive compute.
57
+
58
+ ### Comparison
59
+
60
+ | Approach | What's Fast | What's Slow | Total Time |
61
+ | -------------------- | ------------------ | ------------------------------------------- | ---------------- |
62
+ | **Transcript** | API call (1-3s) | - | **1-3 seconds** |
63
+ | **Frame Extraction** | Extraction (5-20s) | Download (30s-3min) + Vision API (20s-5min) | **1-10 minutes** |
64
+
65
+ ### Do Tools Matter?
66
+
67
+ | Tool | Speed (extraction only) | Verdict |
68
+ | ------- | ----------------------- | --------------- |
69
+ | ffmpeg | ⚡⚡⚡ Fastest (5-10s) | Best choice |
70
+ | OpenCV | ⚡⚡ Fast (10-20s) | Standard choice |
71
+ | moviepy | ⚡ Medium (20-40s) | Python overhead |
72
+
73
+ **For extraction alone:** Tools matter, but all are fast enough.
74
+
75
+ ### When Is Frame Extraction Worth It?
76
+
77
+ **Only when:**
78
+
79
+ - Question is purely visual (no audio/transcript available)
80
+ - Visual information is NOT in video thumbnail/title/description
81
+ - You have no other choice
82
+
83
+ **Examples where necessary:**
84
+
85
+ - "What color shirt is the person wearing at 2:35?"
86
+ - "Count the number of cars visible in the video"
87
+ - "Describe the visual style of the opening scene"
88
+
89
+ **For GAIA #3 and #5:**
90
+
91
+ - Both are content-based (species mentioned, dialogue)
92
+ - Transcript is still fastest (1-3s vs 1-10 min total)
93
+ - Frame extraction as fallback is viable (extraction is fast, but vision processing is slow)
94
+
95
+ **Decision:** Transcript-first approach is correct. Frame extraction is viable fallback if transcript unavailable, but total time still 1-10 min due to download + vision API.
96
+
97
+ ---
98
+
99
+ ## Fallback Strategy
100
+
101
+ **Scenario:** Video has no transcript available
102
+
103
+ **Options:**
104
+
105
+ 1. **Return error** → LLM treats as system_error, skips question ✅ Simple
106
+ 2. **Download + extract frames** → Use vision tool (heavy, slow)
107
+ 3. **Return metadata** (title, description) → LLM infers from context
108
+ 4. **Chain approach:** Transcript → Metadata → Frames
109
+
110
+ **Decision:** Start with audio-to-text fallback (Whisper on ZeroGPU) for higher success rate.
111
+
112
+ ---
113
+
114
+ ## Audio-to-Text Fallback: When No Transcript Available
115
+
116
+ ### The Hierarchy
117
+
118
+ ```
119
+ YouTube URL
120
+
121
+ ├─ Has transcript? ✅ → Use youtube-transcript-api (instant, 1-3 sec)
122
+
123
+ └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
124
+ ```
125
+
126
+ ### Whisper Cost Analysis
127
+
128
+ | Option | Cost | Speed | Verdict |
129
+ | --------------- | ---------- | -------------- | ------------------ |
130
+ | OpenAI API | $0.006/min | ⚡⚡⚡ Fastest | If budget OK |
131
+ | **Open Source** | **FREE** | ⚡⚡ Fast | ⭐ **Recommended** |
132
+ | HuggingFace | FREE | ⚡⚡ Fast | Good alternative |
133
+
134
+ **Decision:** Open-source Whisper (free, no API limits, works offline)
135
+
136
+ ---
137
+
138
+ ### HF Hardware: ZeroGPU ✅
139
+
140
+ | Resource | Available | Whisper Requirements | Verdict |
141
+ | ---------- | ----------- | ------------------------- | --------------------------------- |
142
+ | **CPU** | 4 vCPUs | 1+ cores | ✅ Plenty |
143
+ | **Memory** | 16 GB RAM | 1-10 GB (model-dependent) | ✅ Comfortable |
144
+ | **Disk** | 20 GB | ~150 MB - 1.5 GB | ✅ More than enough |
145
+ | **GPU** | **ZeroGPU** | Optional (faster) | ✅ **Available via subscription** |
146
+
147
+ **ZeroGPU Benefits:**
148
+
149
+ - ✅ Dynamic GPU allocation (5-10x faster than CPU)
150
+ - ✅ Can use larger models (`small`, `medium`) for better accuracy
151
+ - ✅ Still free (subscription benefit)
152
+
153
+ ### Performance: CPU vs ZeroGPU
154
+
155
+ | Model | On CPU | On ZeroGPU | Speedup |
156
+ | -------- | --------- | ------------- | ------- |
157
+ | `base` | 30-60 sec | **5-10 sec** | 5-10x |
158
+ | `small` | 1-2 min | **10-20 sec** | 5-10x |
159
+ | `medium` | 3-5 min | **20-40 sec** | 5-10x |
160
+
161
+ **For 5-minute YouTube video on ZeroGPU:**
162
+
163
+ - `base` model: ~5-10 seconds ⚡⚡⚡
164
+ - `small` model: ~10-20 seconds ⚡⚡
165
+
166
+ ### Recommended Model for ZeroGPU
167
+
168
+ | Model | Size | Accuracy | Speed (ZeroGPU) | Recommendation |
169
+ | -------- | ------ | --------- | --------------- | ---------------------- |
170
+ | `tiny` | 39 MB | Lower | ~5 sec | Fastest, less accurate |
171
+ | `base` | 74 MB | Good | ~10 sec | Good balance |
172
+ | `small` | 244 MB | Better | ~20 sec | ⭐ **Recommended** |
173
+ | `medium` | 769 MB | Very good | ~40 sec | If accuracy critical |
174
+
175
+ **Choice:** `small` model - best accuracy/speed balance on ZeroGPU
176
+
177
+ ### Implementation: Audio-to-Text Fallback
178
+
179
+ ```python
180
+ import whisper
181
+
182
+ _MODEL = None # Cache model globally
183
+
184
+ def transcribe_audio(file_path: str) -> str:
185
+ """Transcribe audio file using Whisper (ZeroGPU)."""
186
+ global _MODEL
187
+ try:
188
+ if _MODEL is None:
189
+ # ZeroGPU auto-detects GPU, no manual device specification
190
+ _MODEL = whisper.load_model("small")
191
+
192
+ result = _MODEL.transcribe(file_path)
193
+ return result["text"]
194
+ except Exception as e:
195
+ return f"ERROR: Transcription failed: {e}"
196
+ ```
197
+
198
+ ---
199
+
200
+ ### Unified Architecture: Phase 1 + Phase 2
201
+
202
+ ```
203
+ ┌─────────────────────────────────────────────────────────┐
204
+ │ Audio Transcription │
205
+ │ (transcribe_audio function) │
206
+ │ Uses Whisper │
207
+ │ on ZeroGPU │
208
+ └─────────────────────────────────────────────────────────┘
209
+
210
+
211
+ ┌───────────────────┴───────────────────┐
212
+ │ │
213
+ Phase 1 Phase 2
214
+ YouTube URLs MP3 Files
215
+ │ │
216
+ │ 1. Try youtube-transcript-api │
217
+ │ 2. Fallback: download audio only │
218
+ │ 3. Call transcribe_audio() │
219
+ │ │
220
+ └───────────────────┬───────────────────┘
221
+
222
+ Clean transcript
223
+
224
+
225
+ LLM analyzes
226
+ ```
227
+
228
+ **Benefits:**
229
+
230
+ - Single audio processing codebase
231
+ - `transcribe_audio()` works for both phases
232
+ - Tested on HF ZeroGPU hardware
233
+ - Higher success rate than skip-only approach
234
+
235
+ ---
236
+
237
+ ## Tool Design - LLM Integration
238
+
239
+ **Current problem:** Vision tool tries to process YouTube URL → fails
240
+
241
+ **Proposed tool description:**
242
+
243
+ ```
244
+ "Extract transcript from YouTube video URL. Use when question asks about
245
+ YouTube video content like: dialogue, speech, bird species identification,
246
+ character quotes, or any content discussed in the video. Input: YouTube URL.
247
+ Returns: Full transcript text or error message if transcript unavailable."
248
+ ```
249
+
250
+ **Alternative: Special URL handling in `parse_file()`**
251
+
252
+ - Detect YouTube URLs
253
+ - Return tool suggestion: "This is a YouTube URL. Consider using youtube_transcript tool."
254
+
255
+ ---
256
+
257
+ ## Implementation Considerations
258
+
259
+ ### A. Video ID Extraction
260
+
261
+ Handle various YouTube URL formats:
262
+
263
+ - `youtube.com/watch?v=VIDEO_ID`
264
+ - `youtu.be/VIDEO_ID`
265
+ - `youtube.com/shorts/VIDEO_ID`
266
+
267
+ ### B. Language Handling
268
+
269
+ - GAIA questions are English → likely English transcripts
270
+ - Question: Should we auto-translate or let LLM handle?
271
+
272
+ ### C. Transcript Format
273
+
274
+ - Raw JSON with timestamps vs clean text
275
+ - LLM prefers clean text without timestamps
276
+ - Question: Preserve timestamps for context?
277
+
278
+ ### D. Error Types
279
+
280
+ - No transcript available
281
+ - Video private/deleted
282
+ - Rate limiting
283
+ - Regional restriction
284
+
285
+ ---
286
+
287
+ ## Testing Strategy
288
+
289
+ **Before full evaluation:**
290
+
291
+ 1. **Unit test** - Test on actual GAIA YouTube URLs
292
+ 2. **Manual test** - Run single question (#3) to verify LLM uses tool correctly
293
+ 3. **Integration test** - Verify transcript → answer pipeline
294
+
295
+ **Question:** Do we have access to actual YouTube URLs for pre-testing?
296
+
297
+ ---
298
+
299
+ ## Edge Cases
300
+
301
+ | Scenario | Handling |
302
+ | --------------------------------- | --------------------------------- |
303
+ | Multiple transcript languages | Pick English or first available |
304
+ | Auto-generated transcript | Accept (less accurate but usable) |
305
+ | YouTube Shorts format | Extract VIDEO_ID from shorts URL |
306
+ | Segmented transcript (by speaker) | Clean to plain text |
307
+
308
+ ---
309
+
310
+ ## Recommendations
311
+
312
+ 1. **Start simple:** youtube-transcript-api with clear error messages
313
+ 2. **Fail gracefully:** If no transcript, return structured error → system_error=yes
314
+ 3. **Tool description:** Emphasize "YouTube video content" for LLM selection
315
+ 4. **Manual test first:** Verify on question #3 before full evaluation
316
+ 5. **Success metric:** Both questions correct → 40% score ✅ TARGET REACHED
317
+
318
+ ---
319
+
320
+ ## Open Questions
321
+
322
+ - [ ] Implement fallback to frame extraction if transcript fails?
323
+ - [ ] Add special YouTube URL detection in `parse_file()`?
324
+ - [ ] Access to actual YouTube URLs for pre-testing?
325
+ - [ ] Simple first vs comprehensive solution?
326
+
327
+ ---
328
+
329
+ ## Files to Create
330
+
331
+ - `src/tools/youtube.py` - YouTube transcript extraction
332
+ - Update `src/tools/__init__.py` - Register youtube_transcript tool
333
+ - Update `requirements.txt` - Add youtube-transcript-api
334
+
335
+ ---
336
+
337
+ ## Next Steps (Discussion → Implementation)
338
+
339
+ 1. [ ] Confirm approach based on video processing research
340
+ 2. [ ] Install youtube-transcript-api
341
+ 3. [ ] Create youtube.py with error handling
342
+ 4. [ ] Add tool to TOOLS registry
343
+ 5. [ ] Manual test on question #3
344
+ 6. [ ] Full evaluation
345
+ 7. [ ] Verify 40% score (4/20 correct)
dev/dev_260102_13_stage2_tool_development.md CHANGED
@@ -140,40 +140,47 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
140
 
141
  **Deliverables:**
142
 
143
- 1. **Web Search Tool** ([src/tools/web_search.py](../src/tools/web_search.py))
 
144
  - Tavily API integration (primary, free tier)
145
  - Exa API integration (fallback, paid)
146
- - Automatic fallback if primary fails
147
- - 10 passing tests in [test/test_web_search.py](../test/test_web_search.py)
 
 
 
148
 
149
- 2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
150
  - PDF parsing (PyPDF2)
151
  - Excel parsing (openpyxl)
152
  - Word parsing (python-docx)
153
- - Text/CSV parsing (built-in open)
154
  - Generic `parse_file()` dispatcher
155
- - 19 passing tests in [test/test_file_parser.py](../test/test_file_parser.py)
 
 
156
 
157
- 3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
158
  - Safe AST-based expression evaluation
159
- - Whitelisted operations only (no code execution)
160
  - Mathematical functions (sin, cos, sqrt, factorial, etc.)
161
- - Security hardened (timeout, complexity limits)
162
- - 41 passing tests in [test/test_calculator.py](../test/test_calculator.py)
163
 
164
- 4. **Vision Tool** ([src/tools/vision.py](../src/tools/vision.py))
165
- - Multimodal image analysis using LLMs
 
166
  - Gemini 2.0 Flash (primary, free)
167
- - Claude Sonnet 4.5 (fallback, paid)
168
  - Image loading and base64 encoding
169
- - 15 passing tests in [test/test_vision.py](../test/test_vision.py)
 
 
 
170
 
171
- 5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
172
  - Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
173
  - TOOLS dict with metadata (description, parameters, category)
174
  - Ready for Stage 3 dynamic tool selection
175
 
176
- 6. **StateGraph Integration** ([src/agent/graph.py](../src/agent/graph.py))
177
  - Updated `execute_node` to load tool registry
178
  - Stage 2: Reports tool availability
179
  - Stage 3: Will add dynamic tool selection and execution
 
140
 
141
  **Deliverables:**
142
 
143
+ 1. **Web Search Tool** ([src/tools/web_search.py](./../src/tools/web_search.pyeb_search.py))
144
+
145
  - Tavily API integration (primary, free tier)
146
  - Exa API integration (fallback, paid)
147
+ - Automatic fallback if primary fails./../test/test_web_search.py
148
+ - 10 passing tests in [test/test_web_search.py](../../agentbee/test/test_web_search.py)
149
+ ./../src/tools/file_parser.py
150
+
151
+ 2. **File Parser Tool** ([src/tools/file_parser.py](../../agentbee/src/tools/file_parser.py))
152
 
 
153
  - PDF parsing (PyPDF2)
154
  - Excel parsing (openpyxl)
155
  - Word parsing (python-docx)
156
+ - Text/CSV parsing (built-in open)./../test/test_file_parser.py
157
  - Generic `parse_file()` dispatcher
158
+ - 19 passing tests in [test/test_file_parser.py./../src/tools/calculator.py_file_parser.py)
159
+
160
+ 3. **Calculator Tool** ([src/tools/calculator.py](../../agentbee/src/tools/calculator.py))
161
 
 
162
  - Safe AST-based expression evaluation
163
+ - Whitelisted operations only (no code execution./../test/test_calculator.py
164
  - Mathematical functions (sin, cos, sqrt, factorial, etc.)
165
+ - Security hardened (timeout, complexit./../src/tools/vision.py
166
+ - 41 passing tests in [test/test_calculator.py](../../agentbee/test/test_calculator.py)
167
 
168
+ 4. **Vision Tool** ([src/tools/vision.py](../../agentbee/src/tools/vision.py))
169
+
170
+ - Multimodal image analysis using LLMs./../test/test_vision.py
171
  - Gemini 2.0 Flash (primary, free)
172
+ - Claude Sonnet 4.5 (fallback, paid)./../src/tools/**init**.py
173
  - Image loading and base64 encoding
174
+ - 15 passing tests in [test/test_vision.py](../../agentbee/test/test_vision.py)
175
+
176
+ 5. **Tool Registry** ([src/tools/**init**.py](../../agentbee/src/tools/__init__.py))
177
+ ./../src/agent/graph.py
178
 
 
179
  - Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
180
  - TOOLS dict with metadata (description, parameters, category)
181
  - Ready for Stage 3 dynamic tool selection
182
 
183
+ 6. **StateGraph Integration** ([src/agent/graph.py](../../agentbee/src/agent/graph.py))
184
  - Updated `execute_node` to load tool registry
185
  - Stage 2: Reports tool availability
186
  - Stage 3: Will add dynamic tool selection and execution
dev/dev_260102_14_stage3_core_logic.md CHANGED
@@ -64,24 +64,28 @@ Successfully implemented Stage 3 with multi-provider LLM support. Agent now perf
64
 
65
  **Deliverables:**
66
 
67
- 1. **LLM Client Module** ([src/agent/llm_client.py](../src/agent/llm_client.py))
 
68
  - Gemini implementation: 3 functions (planning, tool selection, answer synthesis)
69
  - Claude implementation: 3 functions (same)
70
  - Unified API with automatic fallback
71
  - 624 lines of code
 
 
 
72
 
73
- 2. **Updated Agent Graph** ([src/agent/graph.py](../src/agent/graph.py))
74
  - plan_node: Calls `plan_question()` for LLM-based planning
75
  - execute_node: Calls `select_tools_with_function_calling()` + executes tools + collects evidence
76
  - answer_node: Calls `synthesize_answer()` for factoid generation
77
- - Updated AgentState with new fields
 
 
78
 
79
- 3. **LLM Integration Tests** ([test/test_llm_integration.py](../test/test_llm_integration.py))
80
  - 8 tests covering all 3 LLM functions
81
- - Tests use mocked LLM responses (provider-agnostic)
82
  - Full workflow test: planning → tool selection → answer synthesis
83
 
84
- 4. **E2E Test Script** ([test/test_stage3_e2e.py](../test/test_stage3_e2e.py))
85
  - Manual test script for real API testing
86
  - Requires ANTHROPIC_API_KEY or GOOGLE_API_KEY
87
  - Tests simple math and factual questions
 
64
 
65
  **Deliverables:**
66
 
67
+ 1. **LLM Client Module** ([src/agent/llm_client.py](./../src/agent/llm_client.pylm_client.py))
68
+
69
  - Gemini implementation: 3 functions (planning, tool selection, answer synthesis)
70
  - Claude implementation: 3 functions (same)
71
  - Unified API with automatic fallback
72
  - 624 lines of code
73
+ ./../src/agent/graph.py
74
+
75
+ 2. **Updated Agent Graph** ([src/agent/graph.py](../../agentbee/src/agent/graph.py))
76
 
 
77
  - plan_node: Calls `plan_question()` for LLM-based planning
78
  - execute_node: Calls `select_tools_with_function_calling()` + executes tools + collects evidence
79
  - answer_node: Calls `synthesize_answer()` for factoid generation
80
+ - Updated AgentState with new fields./../test/test_llm_integration.py
81
+
82
+ 3. **LLM Integration Tests** ([test/test_llm_integration.py](../../agentbee/test/test_llm_integration.py))
83
 
 
84
  - 8 tests covering all 3 LLM functions
85
+ - Tests use mocked LLM responses (provider-agno./../test/test_stage3_e2e.py
86
  - Full workflow test: planning → tool selection → answer synthesis
87
 
88
+ 4. **E2E Test Script** ([test/test_stage3_e2e.py](../../agentbee/test/test_stage3_e2e.py))
89
  - Manual test script for real API testing
90
  - Requires ANTHROPIC_API_KEY or GOOGLE_API_KEY
91
  - Tests simple math and factual questions
test/README.md CHANGED
@@ -2,13 +2,14 @@
2
 
3
  **Test Files:**
4
 
5
- - [test_agent_basic.py](test_agent_basic.py) - Unit tests for Stage 1 foundation
 
6
  - Agent initialization
7
  - Settings loading
8
  - Basic question processing
9
  - StateGraph structure validation
10
 
11
- - [test_stage1.py](test_stage1.py) - Stage 1 integration verification
12
  - Configuration validation
13
  - Agent initialization
14
  - End-to-end question processing
 
2
 
3
  **Test Files:**
4
 
5
+ - [test_agent_basic.py](../../agentbee/test/test_agent_basic.py) - Unit tests for Stage 1 foundation
6
+
7
  - Agent initialization
8
  - Settings loading
9
  - Basic question processing
10
  - StateGraph structure validation
11
 
12
+ - [test_stage1.py](../../agentbee/test/test_stage1.py) - Stage 1 integration verification
13
  - Configuration validation
14
  - Agent initialization
15
  - End-to-end question processing