mangubee commited on
Commit
f1b095a
·
1 Parent(s): 3dcf523

Enhance YouTube video processing with transcript and frame analysis modes

Browse files

- Updated logging standard to use Markdown format for session logs.
- Modified `run_and_submit_all` to include a new `video_mode` parameter for selecting YouTube processing mode (Transcript or Frames).
- Removed obsolete brainstorming document for YouTube transcript support.
- Added OpenCV and other dependencies for frame extraction in `pyproject.toml` and `requirements.txt`.
- Refactored `llm_client.py` to log session details in Markdown format.
- Implemented `youtube.py` to support both transcript extraction and frame analysis, with appropriate logging and error handling.
- Updated tool descriptions to reflect new functionality for analyzing video frames.
- Added backward compatibility for the `youtube_transcript` function to respect the `YOUTUBE_MODE` environment variable.

CHANGELOG.md CHANGED
@@ -1,1311 +1,581 @@
1
  # Session Changelog
2
 
3
- ## [2026-01-13] [Infrastructure] [COMPLETED] 3-Tier Folder Naming Convention
4
-
5
- **Problem:** Previous rename used `_` prefix for both runtime folders AND user-only folders, creating ambiguity.
6
-
7
- **Solution:** Implemented 3-tier naming convention to clearly distinguish folder purposes.
8
-
9
- **3-Tier Convention:**
10
- 1. **User-only** (`user_*` prefix) - Manual use, not app runtime:
11
- - `user_input/` - User testing files, not app input
12
- - `user_output/` - User downloads, not app output
13
- - `user_dev/` - Dev records (manual documentation)
14
- - `user_archive/` - Archived code/reference materials
15
-
16
- 2. **Runtime/Internal** (`_` prefix) - App creates, temporary:
17
- - `_cache/` - Runtime cache, served via app download
18
- - `_log/` - Runtime logs, debugging
19
-
20
- 3. **Application** (no prefix) - Permanent code:
21
- - `src/`, `test/`, `docs/`, `ref/` - Application folders
22
-
23
- **Folders Renamed:**
24
- - `_input/` → `user_input/` (user testing files)
25
- - `_output/` → `user_output/` (user downloads)
26
- - `dev/` → `user_dev/` (dev records)
27
- - `archive/` → `user_archive/` (archived materials)
28
-
29
- **Folders Unchanged (correct tier):**
30
- - `_cache/`, `_log/` - Runtime ✓
31
- - `src/`, `test/`, `docs/`, `ref/` - Application ✓
32
-
33
- **Updated Files:**
34
- - **test/test_phase0_hf_vision_api.py** - `Path("_output")` → `Path("user_output")`
35
- - **.gitignore** - Updated folder references and comments
36
 
37
- **Git Status:**
38
- - Old folders removed from git tracking
39
- - New folders excluded by .gitignore
40
- - Existing files become untracked
41
 
42
- **Result:** Clear 3-tier structure: user_*, _*, and no prefix
43
 
44
- ---
45
 
46
- ## [2026-01-13] [Infrastructure] [COMPLETED] Runtime Folder Naming Convention - Underscore Prefix
 
47
 
48
- **Problem:** Folders `log/`, `output/`, and `input/` didn't clearly indicate they were runtime-only storage, making it unclear which folders are internal vs permanent.
 
49
 
50
- **Solution:** Renamed all runtime-only folders to use `_` prefix, following Python convention for internal/private.
51
 
52
- **Folders Renamed:**
53
- - `log/` → `_log/` (runtime logs, debugging)
54
- - `output/` → `_output/` (runtime results, user downloads)
55
- - `input/` → `_input/` (user testing files, not app input)
56
-
57
- **Rationale:**
58
- - `_` prefix signals "internal, temporary, not part of public API"
59
- - Consistent with Python convention (`_private`, `__dunder__`)
60
- - Distinguishes runtime storage from permanent project folders
61
- - `_cache/` already followed this convention ✓
62
-
63
- **Updated Files:**
64
- - **src/agent/llm_client.py** - `Path("log")` → `Path("_log")`
65
- - **src/tools/youtube.py** - `Path("log")` → `Path("_log")`
66
- - **test/test_phase0_hf_vision_api.py** - `Path("output")` → `Path("_output")`
67
- - **.gitignore** - Updated folder references
68
-
69
- **Git Status:**
70
- - Old folders removed from git tracking
71
- - New folders excluded by .gitignore
72
- - Existing files in those folders become untracked
73
-
74
- **Result:** Clear separation between runtime storage (`_` prefix) and permanent project folders (no prefix)
75
 
76
- ---
77
 
78
- ## [2026-01-13] [Infrastructure] [COMPLETED] Session Log Consolidation - Single File Per Run
79
 
80
- **Problem:** Each question created a separate log file (`llm_context_TIMESTAMP.txt`), polluting the log/ folder with 20+ files per evaluation.
 
 
 
81
 
82
- **Solution:** Implemented session-level log file - all questions append to single file per run.
 
 
 
83
 
84
- **Implementation:**
85
 
86
- 1. **Added session log management** (`llm_client.py`)
87
 
88
- - Module-level `_SESSION_LOG_FILE` variable
89
- - `get_session_log_file()` - Creates/reuses session log file
90
- - `reset_session_log()` - For testing/new runs
 
91
 
92
- 2. **Changed log file naming**
93
 
94
- - Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
95
- - New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
96
 
97
- 3. **Updated log format**
98
- - Added session header with start time
99
- - Each question wrapped in `QUESTION START` / `QUESTION END` markers
100
- - All questions append to same file
 
101
 
102
  **Modified Files:**
103
 
104
- - **src/agent/llm_client.py** (~50 lines modified)
105
- - Added session log management functions
106
- - Updated `synthesize_answer_hf()` to use session log
107
- - Added imports: `datetime`, `Path`
108
-
109
- **Result:** Single log file per evaluation instead of 20+ files
110
 
111
- ---
112
 
113
- ## [2026-01-13] [Stage 1: YouTube Support] [MILESTONE] 30% Target Achieved!
114
 
115
- **Score:** 30% (6/20 correct) - **First time hitting course target! 🎉**
116
-
117
- **Phase 1 Impact - YouTube + Audio Support:**
118
-
119
- - **Before:** 10% (2/20 correct)
120
- - **After:** 30% (6/20 correct)
121
- - **Improvement:** +20% (+4 questions fixed)
122
-
123
- **Questions Fixed by Phase 1:**
124
-
125
- 1. a1e91b78: YouTube bird species (3) ✓ - youtube_transcript + Whisper
126
- 2. 9d191bce: YouTube Teal'c quote (Extremely) ✓ - youtube_transcript + Whisper
127
- 3. 99c9cc74: Strawberry pie MP3 (ingredients) ✓ - transcribe_audio (Whisper)
128
- 4. 1f975693: Calculus MP3 (page numbers) ✓ - transcribe_audio (Whisper)
129
-
130
- **Remaining Issues:**
131
-
132
- - 3 system errors (vision NoneType, .py execution, calculator)
133
- - 10 "Unable to answer" (search evidence extraction issues)
134
-
135
- **Next Priority:**
136
-
137
- - Fix system errors (vision tool, Python execution)
138
- - Improve search answer extraction
139
- - Consider Phase 2.5 improvements
140
-
141
- ---
142
-
143
- ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Chain of Thought for LLM Synthesis Debugging
144
-
145
- **Problem:** LLM returns "Unable to answer" with no reasoning. Can't debug why synthesis fails despite having complete transcript evidence.
146
-
147
- **Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
148
-
149
- **Response Format:**
150
-
151
- ```
152
- REASONING: [Step-by-step thought process]
153
- - What information is in the evidence?
154
- - What is the question asking for?
155
- - How do you extract the answer?
156
- - Any ambiguities or uncertainties?
157
-
158
- FINAL ANSWER: [Factoid answer]
159
- ```
160
 
161
  **Implementation:**
162
 
163
- 1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
164
-
165
- - Request two-part response: REASONING + FINAL ANSWER
166
- - Clear examples showing expected format
167
- - Instructions for handling insufficient evidence
168
-
169
- 2. **Increased max_tokens** from 256 → 1024
170
 
171
- - Accommodate longer reasoning text
172
- - Allow space for both reasoning and answer
173
 
174
- 3. **Added parsing logic** to extract FINAL ANSWER
175
-
176
- - Split response on "FINAL ANSWER:" delimiter
177
- - Return only answer to agent (short for UI)
178
- - Save full response (with reasoning) to log file
179
-
180
- 4. **Enhanced log file format** (log/llm_context_TIMESTAMP.txt)
181
- - Full LLM response with reasoning
182
- - Extracted final answer
183
- - Clear separation markers
184
-
185
- **Modified Files:**
186
-
187
- - **src/agent/llm_client.py** (~50 lines modified)
188
- - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
189
- - Updated `synthesize_answer_groq()` - Same changes
190
- - Updated `synthesize_answer_claude()` - Same changes
191
-
192
- **Result:** Can now inspect LLM's thought process in log files to debug synthesis failures
193
-
194
- ---
195
-
196
- ## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
197
-
198
- **Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.
199
-
200
- **Solution:** Separated console output (status workflow) from detailed logs (file-based).
201
-
202
- **Console Output (Compressed):**
203
-
204
- - Status updates: `[plan] ✓ 660 chars`, `[execute] 1 tool(s) selected`, `[answer] ✓ 3`
205
- - Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
206
- - Success/failure: `✓` for success, `✗` for failure
207
- - File exports: `Context saved to: log/llm_context_*.txt`
208
-
209
- **Log Files (log/ folder):**
210
-
211
- - `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
212
- - `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
213
- - Purpose: Post-run analysis, context preservation, debugging
214
-
215
- **Modified Files:**
216
-
217
- - **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
218
- - **src/agent/graph.py** (~50 lines → ~15 lines) - Compressed node logs, removed separators
219
- - **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
220
- - **src/tools/youtube.py** (2 lines) - Save transcripts to log/ folder
221
- - **CLAUDE.md** (+30 lines) - Document logging standard
222
- - **.gitignore** (+3 lines) - Exclude log/ folder
223
-
224
- **Global Rule Update (~/.claude/CLAUDE.md):**
225
-
226
- - Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
227
- - Removed "logs/" from prohibited folders list
228
- - Updated folder purposes table with log/ entry
229
-
230
- **Result:** 16k tokens → ~6.7k tokens (58% reduction)
231
-
232
- **Standard Structure:**
233
 
234
  ```
235
- ##_ProjectName/
236
- ├── archive/ # Previous solutions, references
237
- ├── input/ # Raw datasets, config files
238
- ├── output/ # Execution results (gitignored)
239
- ├── log/ # Runtime logs, LLM context (gitignored)
240
- ├── test/ # Test files, data, configs
241
- ├── dev/ # Dev records, problem solved
242
  ```
243
 
244
- ---
245
-
246
- ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation
247
-
248
- **Discovery:** HuggingFace Provider Suffix Behavior - Auto-Routing is Bad Practice
249
-
250
- **Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
251
-
252
- **Test Result:**
253
-
254
- ```python
255
- # Without provider - WORKS but uses HF default routing
256
- HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # ✅ Works, but...
257
- # Response: "Test successful."
258
 
259
- # With explicit provider - RECOMMENDED
260
- HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway" # ✅ Reliable
261
  ```
262
-
263
- **Why Auto-Routing is Bad Practice:**
264
-
265
- | Issue | Impact |
266
- | ----------------------------- | --------------------------------------------------------------- |
267
- | **Unpredictable performance** | Provider changes between runs (fast Cerebras → slow Together) |
268
- | **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
269
- | **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
270
- | **Debugging nightmare** | Can't reproduce issues when provider is unknown |
271
- | **Silent failures** | Provider might be down, HF retries with different one |
272
-
273
- **Best Practice: ALWAYS specify provider**
274
-
275
- ```python
276
- # BAD - Unreliable
277
- HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
278
-
279
- # GOOD - Explicit, predictable
280
- HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"
281
- HF_MODEL = "Qwen/Qwen2.5-72B-Instruct:cerebras"
282
- HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
283
  ```
284
 
285
- **Available Providers for Text Models:**
286
-
287
- - `:scaleway` - Fast, reliable (recommended for Llama)
288
- - `:cerebras` - Very fast (recommended for Qwen)
289
- - `:novita` - Fast, reputable
290
- - `:together` - Reliable
291
- - `:sambanova` - Fast but expensive
292
-
293
- **Action Taken:** Updated code to always use explicit `:provider` suffix
294
-
295
- ---
296
-
297
- ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
298
-
299
- **Model Changes:**
300
-
301
- 1. Qwen 2.5 72B (no provider) → Failed synthesis ("Unable to answer")
302
- 2. Llama 3.3 70B (Scaleway) → Failed synthesis
303
- 3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
304
-
305
- **openai/gpt-oss-120b:**
306
-
307
- - OpenAI's 120B parameter open source model
308
- - Strong reasoning capability
309
- - Optimized for function calling and tool use
310
-
311
- ---
312
-
313
- ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation (Original)
314
-
315
- **Problem:** Qwen 2.5 72B fails synthesis despite having complete transcript evidence (738 chars).
316
-
317
- **Root Cause Analysis:**
318
-
319
- - Transcript contains all 3 species: "giant petrel", "emperor", "adelie" (Whisper error: "deli")
320
- - Qwen 2.5 cannot resolve transcription errors ("deli" → "adelie penguin")
321
- - Qwen 2.5 weak at entity extraction + counting from noisy text
322
- - Returns "Unable to answer" instead of reasoning through ambiguity
323
-
324
- **Transcript Quality Assessment:**
325
-
326
- - **NOT clear enough for current LLM** - requires:
327
- 1. Error tolerance ("deli" → "adelie")
328
- 2. World knowledge (Antarctic bird species)
329
- 3. Entity extraction from narrative text
330
- 4. Temporal reasoning ("simultaneously" = same scene)
331
-
332
- **Answer from transcript:** 3 species (giant petrel, emperor penguin, adelie penguin)
333
-
334
- **Solution:** Upgrade to Llama 3.3 70B Instruct (Scaleway provider)
335
-
336
- - Better reasoning and instruction following
337
- - Stronger entity extraction from noisy context
338
- - Better at handling transcription ambiguities
339
-
340
- **Modified Files:**
341
-
342
- - **src/agent/llm_client.py** (line 37) - Model: Qwen 2.5 → Llama 3.3 70B
343
-
344
- ---
345
-
346
- ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Transcript Caching for Debugging
347
-
348
- **Problem:** Transcription works (738 chars from Whisper) but LLM returns "Unable to answer". Need to inspect raw transcript to debug synthesis failure.
349
-
350
- **Solution:** Added `save_transcript_to_cache()` function to save transcripts to `_cache/{video_id}_transcript.txt` for both API and Whisper paths.
351
-
352
- **Modified Files:**
353
-
354
- - **src/tools/youtube.py** (+30 lines)
355
- - Added `save_transcript_to_cache()` function (lines 55-79)
356
- - Calls after successful API transcript retrieval (line 164)
357
- - Calls after successful Whisper transcription (line 317)
358
- - File format includes metadata: video_id, source, length, timestamp
359
-
360
- **File Format:**
361
 
362
  ```
363
- # YouTube Transcript
364
- # Video ID: L1vXCYZAYYM
365
- # Source: whisper
366
- # Length: 738 characters
367
- # Generated: 2026-01-13T02:27:...
368
-
369
- <transcript text>
370
  ```
371
 
372
- **Next Steps:**
373
-
374
- - Test on question #3 (bird species) - inspect cached transcript
375
- - Debug LLM synthesis failure if transcript contains correct answer
376
-
377
- ---
378
-
379
- ## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
380
-
381
- **Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
382
-
383
- **Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
384
 
385
  **Modified Files:**
386
 
387
- - **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
388
- - **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
389
- - **src/tools/**init**.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
390
- - **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
391
- - **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
392
-
393
- **Key Technical Decisions:**
394
-
395
- - **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
396
- - **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
397
- - **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
398
- - **Whisper model:** `small` (244MB) - best accuracy/speed balance on ZeroGPU (10-20s for 5-min video)
399
- - **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
400
-
401
- **Expected Impact:**
402
-
403
- - Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
404
- - Score: 10% → 40% (2/20 → 4/20 correct)
405
- - **Target achieved:** Exceeds 30% requirement (6/20)
406
-
407
- ---
408
-
409
- ## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
410
-
411
- **Purpose:** Understand which parts of template are FIXED (course API contract) vs CAN MODIFY (our improvements).
412
-
413
- **Critical Finding:** Course API has a FIXED test setup - questions are NOT random.
414
-
415
- ### Fixed (Course API Contract - DO NOT CHANGE)
416
-
417
- | Aspect | Value | Cannot Change |
418
- | ----------------------- | -------------------------------------- | ------------- |
419
- | **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
420
- | **Questions Route** | `GET /questions` | ❌ |
421
- | **Submit Route** | `POST /submit` | ❌ |
422
- | **Number of Questions** | **20** (always 20) | ❌ |
423
- | **Question Source** | GAIA validation set, level 1 | ❌ |
424
- | **Randomness** | **NO - Fixed set** | ❌ |
425
- | **Difficulty** | All level 1 (easiest) | ❌ |
426
- | **Filter Criteria** | By tools/steps complexity | ❌ |
427
- | **Scoring** | EXACT MATCH | ❌ |
428
- | **Target Score** | 30% = 6/20 correct | ❌ |
429
-
430
- ### The 20 Questions (ALWAYS the Same)
431
-
432
- | # | Full Task ID | Description | Tools Required |
433
- | --- | -------------------------------------- | ------------------------------ | ---------------- |
434
- | 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
435
- | 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
436
- | 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
437
- | 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
438
- | 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
439
- | 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
440
- | 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
441
- | 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
442
- | 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
443
- | 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
444
- | 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
445
- | 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
446
- | 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
447
- | 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
448
- | 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
449
- | 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
450
- | 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
451
- | 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
452
- | 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
453
- | 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
454
-
455
- **NOT random** - same 20 questions every submission!
456
-
457
- ### Template Contract (MUST Preserve)
458
-
459
- ```python
460
- # REQUIRED - Do NOT change
461
- questions_url = f"{api_url}/questions" # Fixed route
462
- submit_url = f"{api_url}/submit" # Fixed route
463
-
464
- submission_data = {
465
- "username": username,
466
- "agent_code": agent_code,
467
- "answers": answers_payload # Fixed format
468
- }
469
- ```
470
-
471
- ### Our Additions (SAFE to Modify)
472
-
473
- | Feature | Purpose | Required? |
474
- | ------------------ | ---------------------- | ----------- |
475
- | Question Limit | Debug: run first N | ✅ Optional |
476
- | Target Task IDs | Debug: run specific | ✅ Optional |
477
- | ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
478
- | System Error Field | UX: error tracking | ✅ Optional |
479
- | File Download (HF) | Feature: support files | ✅ Optional |
480
-
481
- ### Key Learnings
482
-
483
- 1. **Question set is FIXED** - not random, always same 20
484
- 2. **API routes are FIXED** - cannot change endpoints
485
- 3. **Submission format is FIXED** - must match exactly
486
- 4. **Our additions are OPTIONAL** - debug/features we added
487
- 5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
488
-
489
- **Reference:** `user_io/reference/project_template_original/app.py` for original structure
490
-
491
- ---
492
 
493
- ## [2026-01-12] [Infrastructure] [COMPLETED] Original Template Reference Added
494
 
495
- **Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
496
 
497
- **Process:**
498
 
499
- 1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
500
- 2. Removed git-specific files (`.git/` folder, `.gitattributes`)
501
- 3. Copied to project as `user_io/reference/project_template_original/` (static reference, no git)
502
- 4. Cleaned up temporary clone from Downloads
503
 
504
- **Why Static Reference:**
505
-
506
- - No `.git/` folder → won't interfere with project's git
507
- - No `.gitattributes` → clean file comparison
508
- - Pure reference material for diff/comparison
509
- - Can see exactly what changed from original
510
-
511
- **Template Original Contents:**
512
-
513
- - `app.py` (8777 bytes - original)
514
- - `README.md` (400 bytes - original)
515
- - `requirements.txt` (15 bytes - original)
516
-
517
- **Comparison Commands:**
518
-
519
- ```bash
520
- # Compare file sizes
521
- ls -lh user_io/reference/project_template_original/app.py app.py
522
-
523
- # See differences
524
- diff user_io/reference/project_template_original/app.py app.py
525
-
526
- # Count lines added
527
- wc -l app.py user_io/reference/project_template_original/app.py
528
  ```
529
 
530
- **Created Files:**
531
-
532
- - **\user_io/reference/project_template_original/** (NEW) - Static reference to original template (3 files)
533
-
534
- ---
535
-
536
- ## [2026-01-12] [Infrastructure] [COMPLETED] HuggingFace Space Renamed
537
-
538
- **Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
539
-
540
- **Actions Taken:**
541
-
542
- 1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
543
- 2. Updated local git remote to point to new URL
544
- 3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
545
- 4. Pulled from remote (sync after rename - already up to date)
546
- 5. Pushed commits to renamed Space: `c86df49..41ac444`
547
-
548
- **Key Learnings:**
549
-
550
- - Local folder name ≠ git repo identity (can rename locally without affecting remote)
551
- - Git remote URL determines push destination (updated to `agentbee`)
552
- - HuggingFace Space name is independent of local folder name
553
- - All work preserved through rename process
554
-
555
- **Current State:**
556
-
557
- - Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
558
- - Remote: `mangubee/agentbee` (renamed on HuggingFace)
559
- - Sync: ✅ All changes pushed
560
- - Git: All commits synced
561
- - Template: `user_io/reference/project_template_original/` added for comparison
562
-
563
- ---
564
-
565
- ## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
566
-
567
- **Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
568
-
569
- **Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
570
-
571
- **Solution:** Created `docs/gaia_submission_guide.md` documenting:
572
-
573
- - **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
574
- - **Official GAIA Leaderboard** (future): 450+ questions, different submission format
575
- - API routes, submission formats, scoring differences
576
- - Development workflow for both
577
-
578
- **Key Clarifications:**
579
- | Aspect | Course | Official GAIA |
580
- |--------|--------|--------------|
581
- | API | `agents-course-unit4-scoring.hf.space` | `gaia-benchmark/leaderboard` Space |
582
- | Questions | 20 (level 1) | 450+ (all levels) |
583
- | Target | 30% (6/20) | Competitive placement |
584
- | Debug features | Target Task IDs, Question Limit | Must submit ALL |
585
- | Submission | JSON POST | File upload |
586
-
587
- **Created Files:**
588
-
589
- - **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
590
 
591
  **Modified Files:**
592
 
593
- - **README.md** - Added note linking to submission guide
594
-
595
- ---
596
-
597
- ## [2026-01-12] [Feature] [COMPLETED] Target Specific Task IDs
598
-
599
- **Problem:** No way to run specific questions for debugging. Had to run full evaluation or use "first N" limit, which is inefficient for targeted fixes.
600
 
601
- **Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
602
 
603
- **Implementation:**
604
 
605
- - Added `eval_task_ids` textbox in UI (line 763-770)
606
- - Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
607
- - Filtering logic: Parses comma-separated IDs, filters `questions_data`
608
- - Shows missing IDs warning if task_id not found in dataset
609
- - Overrides question_limit when provided
610
 
611
- **Usage:**
612
-
613
- ```
614
- Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
615
- ```
616
 
617
  **Modified Files:**
618
 
619
- - **app.py** (~30 lines added)
620
- - UI: `eval_task_ids` textbox
621
- - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
622
- - `run_button.click()`: Pass task_ids to function
623
-
624
- ---
625
 
626
- ## [2026-01-12] [Bug Fix] [COMPLETED] Calculator Threading Issue
627
-
628
- **Problem:** Calculator tool fails with `ValueError: signal only works in main thread of the main interpreter` when running in Gradio's ThreadPoolExecutor context.
629
-
630
- **Root Cause:** `signal.alarm()` only works in the main thread. Our agent uses `ThreadPoolExecutor` for concurrent processing (max_workers=5).
631
-
632
- **Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
633
-
634
- **Modified Files:**
635
 
636
- - **src/tools/calculator.py** (~15 lines modified)
637
- - `timeout()` context manager: Try/except for signal.alarm() failure
638
- - Logs warning when timeout protection disabled
639
- - Gracefully handles Windows (AttributeError for SIGALRM)
640
 
641
- ---
642
 
643
- ## [2026-01-12] [Feature] [COMPLETED] System Error Field
644
-
645
- **Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
646
-
647
- **Solution:** Changed to boolean `system_error: yes/no` field:
648
-
649
- - `system_error: yes` - Technical/system error from our code (don't submit)
650
- - `system_error: no` - AI response (submit answer, even if wrong)
651
- - Added `error_log` field with full error details for system errors
652
 
653
  **Implementation:**
654
 
655
- - `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
656
- - Results table: "System Error" column (yes/no), "Error Log" column (when yes)
657
- - JSON export: `system_error` field, `error_log` field (when system error)
658
- - Submission logic: Only submit when `system_error == "no"`
659
-
660
- **Modified Files:**
661
-
662
- - **app.py** (~30 lines modified)
663
- - `a_determine_status()`: Returns tuple instead of string
664
- - `process_single_question()`: Uses new format, adds `error_log`
665
- - Results table: "System Error" + "Error Log" columns
666
- - `export_results_to_json()`: Include `system_error` and `error_log`
667
-
668
- ---
669
-
670
- ## [2026-01-12] [Refactoring] [COMPLETED] Fallback UI Removal
671
-
672
- **Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
673
-
674
- **Solution:** Removed all fallback-related UI elements:
675
 
676
- - Removed `enable_fallback_checkbox` from Test Question tab
677
- - Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
678
- - Removed `enable_fallback` parameter from `test_single_question()` function
679
- - Removed `enable_fallback` parameter from `run_and_submit_all()` function
680
- - Removed `ENABLE_LLM_FALLBACK` environment variable setting
681
- - Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
682
 
683
- **Modified Files:**
684
-
685
- - **app.py** (~20 lines removed)
686
- - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
687
- - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
688
- - Updated `test_button.click()` inputs to remove checkbox reference
689
- - Updated `run_button.click()` inputs to remove checkbox reference
690
-
691
- ---
692
-
693
- ## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
694
-
695
- **Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
696
 
697
- - 4 providers to test for each feature
698
- - Complex debugging with multiple code paths
699
- - Longer, less clear error messages
700
- - Adding complexity without clear benefit
701
 
702
- **Solution:** Archive fallback mechanism, use single provider only
703
 
704
- - Removed fallback provider loop (Gemini → HF → Groq → Claude)
705
- - Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
706
- - If provider fails, error is raised immediately
707
- - Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
708
-
709
- **Benefits:**
710
-
711
- - ✅ Reduced code complexity
712
- - ✅ Faster debugging (one code path)
713
- - ✅ Clearer error messages
714
- - ✅ No double work on features
715
-
716
- **Modified Files:**
717
 
718
- - **src/agent/llm_client.py** (~25 lines removed)
719
- - Simplified `_call_with_fallback()`: Removed fallback logic
720
- - **dev/dev_260112_02_fallback_archived.md** (NEW)
721
- - Archived fallback code documentation
722
- - Migration guide for restoration if needed
723
 
724
- ---
 
725
 
726
- ## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Search Results Not Being Extracted
727
 
728
- **Problem:** Score dropped from 5% 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
729
 
730
- **Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
731
 
732
- ```python
733
- {"results": [...], "source": "tavily", "query": "...", "count": N}
734
  ```
735
-
736
- **Solution:** Handle both dict formats in evidence extraction:
737
-
738
- ```python
739
- if isinstance(result, dict):
740
- if "answer" in result:
741
- evidence.append(result["answer"]) # Vision tools
742
- elif "results" in result:
743
- # Format search results as readable text
744
- results_list = result.get("results", [])
745
- formatted = []
746
- for r in results_list[:3]:
747
- formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
748
- evidence.append("\n\n".join(formatted)) # Search tools
749
  ```
750
 
751
- **Modified Files:**
752
-
753
- - **src/agent/graph.py** (~40 lines modified)
754
- - Updated evidence extraction in primary path
755
- - Updated evidence extraction in fallback path
756
-
757
- **Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
758
-
759
- **Summary of Fixes (Session 2026-01-12):**
760
-
761
- 1. ✅ File download from HF dataset (5/5 files)
762
- 2. ✅ Absolute paths from script location
763
- 3. ✅ Evidence formatting for vision tools (dict → answer)
764
- 4. ✅ Evidence formatting for search tools (dict → formatted text)
765
-
766
- ---
767
-
768
- ## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Dict Results Not Being Extracted
769
-
770
- **Problem:** Chess vision question returned "Unable to answer" even though vision tool correctly extracted the chess position.
771
-
772
- **Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
773
-
774
- **Solution:** Extract 'answer' field from dict results before adding to evidence:
775
-
776
- ```python
777
- # Before
778
- evidence.append(f"[{tool_name}] {result}") # Dict → string representation
779
-
780
- # After
781
- if isinstance(result, dict) and "answer" in result:
782
- evidence.append(result["answer"]) # Extract answer field
783
- elif isinstance(result, str):
784
- evidence.append(result)
785
- ```
786
-
787
- **Modified Files:**
788
-
789
- - **src/agent/graph.py** (~15 lines modified)
790
- - Updated `execute_node()`: Extract 'answer' from dict results
791
- - Fixed both primary and fallback execution paths
792
-
793
- **Test Result:** Simple search questions now work. Chess question still fails due to vision tool extracting wrong turn indicator (w instead of b).
794
-
795
- **Known Issue:** Vision tool extracts "w - - 0 1" (White's turn) but question asks for Black's move. Ground truth is "Rd5" (Black move), so FEN extraction may have error.
796
-
797
- ---
798
-
799
- ## [2026-01-12] [File Download Fix] [COMPLETED] Absolute Path Fix - Vision Tool Now Works
800
 
801
- **Problem:** Chess vision question returned "Unable to answer" even though file was downloaded successfully.
 
 
802
 
803
- **Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
804
 
805
- **Solution:** Return absolute paths from `download_task_file()`
 
 
806
 
807
- - Changed: `target_path = os.path.join(save_dir, file_name)`
808
- - To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
809
- - Now tools can find files regardless of working directory
810
 
811
  **Modified Files:**
812
 
813
- - **app.py** (~3 lines modified)
814
- - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
815
-
816
- **Test Result:** Vision tool now works with absolute path - correctly analyzes chess position
817
-
818
- ---
819
 
820
- ## [2026-01-12] [File Download Fix] [COMPLETED] GAIA File API Dead End - Switch to HF Dataset
821
 
822
- **Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
823
 
824
  **Investigation:**
825
 
826
- - Checked API spec: Endpoint exists with proper documentation
827
- - Tested download: HTTP 404 "No file path associated with task_id"
828
- - Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
829
- - Confirmed via Swagger UI: Same 404 error
 
830
 
831
- **Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
832
 
833
- **Solution:** Switch from evaluation API to GAIA dataset download
834
 
835
- - Use `huggingface_hub.hf_hub_download()` to fetch files
836
- - Download to `_cache/gaia_files/` (runtime cache)
837
- - File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
838
- - Added cache checking (reuse downloaded files)
839
 
840
- **Files with attachments (5/20 questions):**
 
 
841
 
842
- - `cca530fc`: Chess position image (.png)
843
- - `99c9cc74`: Pie recipe audio (.mp3)
844
- - `f918266a`: Python code (.py)
845
- - `1f975693`: Calculus audio (.mp3)
846
- - `7bd855d8`: Menu sales Excel (.xlsx)
847
-
848
- **Modified Files:**
849
 
850
- - **app.py** (~70 lines modified)
851
- - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
852
- - Changed signature: `download_task_file(task_id, file_name, save_dir)`
853
- - Added `huggingface_hub` import with cache checking
854
- - Default directory: `_cache/gaia_files/` (runtime cache, not git)
855
- - Flat file structure: `_cache/gaia_files/{file_name}`
856
- - **app.py** (~5 lines modified)
857
- - Updated `process_single_question()`: Pass `file_name` to download function
858
 
859
- **Known Limitations:**
860
 
861
- - Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
862
- - `.mp3` audio files still unsupported
863
- - `.py` code execution still unsupported
864
 
865
- **Next Steps:**
866
 
867
- 1. Test new download implementation
868
- 2. Expand tool support for .mp3 (audio transcription)
869
- 3. Expand tool support for .py (code execution)
870
 
871
- ---
872
 
873
- ## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision LLM Validated - Ready for GAIA
 
 
 
874
 
875
- **Problem:** Need to validate HF vision works before complex GAIA evaluation.
876
 
877
- **Solution:** Single smoke test with simple red square image.
 
878
 
879
- **Result:** PASSED
 
880
 
881
- - Model: `google/gemma-3-27b-it:scaleway`
882
- - Answer: "The image is a solid, uniform field of red color..."
883
- - Provider routing: Working correctly
884
- - Settings integration: Fixed
885
 
886
- **Modified Files:**
 
 
 
887
 
888
- - **src/config/settings.py** (~5 lines added)
889
- - Added `HF_TOKEN` and `HF_VISION_MODEL` config
890
- - Added `hf_token` and `hf_vision_model` to Settings class
891
- - Updated `validate_api_keys()` to include huggingface
892
- - **test/test_smoke_hf_vision.py** (NEW - ~50 lines)
893
- - Simple smoke test script
894
- - Tests basic image description
895
 
896
- **Bug Fixes:**
 
897
 
898
- - Removed unsupported `timeout` parameter from `chat_completion()`
899
 
900
- **Next Steps:** Phase 3 - GAIA evaluation with HF vision
 
901
 
902
- ---
903
 
904
- ## [2026-01-11] [Phase 1: Implementation] [COMPLETED] HF Vision Integration - Routing Fixed
 
 
905
 
906
- **Problem:** Vision tool hardcoded to Gemini Claude, ignoring UI LLM selection.
907
 
908
- **Solution:**
909
 
910
- - Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
911
- - Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
912
- - Each provider fails independently (NO fallback chains during testing)
913
 
914
- **Modified Files:**
915
 
916
- - **src/tools/vision.py** (~120 lines added/modified)
917
- - Added `analyze_image_hf()` function with retry logic
918
- - Updated `analyze_image()` routing with provider selection
919
- - Added HF_VISION_MODEL and HF_TIMEOUT config
920
- - **.env.example** (~4 lines added)
921
- - Documented HF_TOKEN and HF_VISION_MODEL settings
922
 
923
- **Validated Models (Phase 0 Extended Testing):**
 
 
924
 
925
- | Rank | Model | Provider | Speed | Notes |
926
- | ---- | -------------------------------- | -------- | ----- | ------------------------------ |
927
- | 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
928
- | 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
929
- | 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
930
- | 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
931
 
932
- **Failed Models (not vision-capable):**
 
 
933
 
934
- - `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
935
- - `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
936
- - `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
937
- - `moonshotai/Kimi-K2-Instruct-0905:novita` - 400 Bad request
938
 
939
- **Next Steps:** Smoke tests (Phase 2) to validate integration
 
 
 
940
 
941
- ---
942
 
943
- ## [2026-01-11] [Phase 0 Extended] [COMPLETED] Additional Vision Models Tested - Google Gemma 3 Selected
944
 
945
- **Problem:** Needed to find more reputable vision models (aya-vision-32b brand unknown to user).
946
 
947
- **Solution:** Tested user-requested models with provider routing.
948
 
949
- **Test Results:**
950
 
951
- **Working Models:**
 
 
 
952
 
953
- - `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
954
- - `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
955
- - `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
956
 
957
- **Failed Models:**
 
 
958
 
959
- - `zai-org/GLM-4.7:cerebras` - Text-only model (422: "image_url not supported")
960
- - `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
961
- - `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
962
- - `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
963
 
964
- **Output Files:**
965
 
966
- - `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
967
- - `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
968
- - `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
969
- - `output/phase0_vision_validation_20260111_164945.json` - Gemma-3-27B test
970
 
971
- **Decision:** Use `google/gemma-3-27b-it:scaleway` for production (fastest, most reputable brand)
972
 
973
- ---
974
 
975
- ## [2026-01-07] [Phase 0: API Validation] [COMPLETED] HF Inference Vision Support - GO Decision
 
 
976
 
977
- **Problem:** Needed to validate HF Inference API supports vision models before implementation.
978
 
979
- ---
980
 
981
- ### Knowledge Updates
982
 
983
- **Solution - Phase 0 Validation Results:**
984
 
985
- **✅ GO Decision - Proceed to Phase 1**
 
 
 
 
986
 
987
- **Final Working Model (Recommended):**
988
 
989
- - **CohereLabs/aya-vision-32b** (32B params, Cohere provider) - **PRODUCTION READY**
990
- - Handles small images (1KB base64): ~1-3 seconds
991
- - Handles large images (2.8MB base64): ~10 seconds, no timeout
992
- - Excellent quality: Detailed scene understanding, object identification, spatial relationships
993
- - Sample response on workspace image: "The image depicts a serene workspace setup on a wooden desk...white ceramic mug filled with dark liquid...silver laptop...rolled-up paper secured with rubber band..."
994
 
995
- **Partially Working Models (Timeout Issues with Large Images):**
996
 
997
- 1. **Qwen/Qwen3-VL-8B-Instruct** (8B params, Novita provider) - ⚠️ Conditionally working
998
- - Small images (1KB): ✅ Works
999
- - Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
1000
- - Only works with models that have `?inference_provider=` in URL
1001
- 2. **baidu/ERNIE-4.5-VL-424B-A47B-Base-PT** (424B params, Novita provider) - ⚠️ Conditionally working
1002
- - Small images (1KB): ✅ Works
1003
- - Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
1004
 
1005
- **Failed Models:**
1006
 
1007
- 1. `deepseek-ai/DeepSeek-OCR` - Not exposed via Inference API (requires local GPU)
1008
- - Attempted both chat_completion and image_to_text endpoints
1009
- - Error: "Task 'image-to-text' not supported for provider 'novita'"
1010
- - Solution: Must use transformers library locally (not serverless API)
1011
- 2. `CohereLabs/command-a-vision-07-2025` - 429 rate limit (try later)
1012
- 3. `zai-org/GLM-4.1V-9B-Thinking` - Provider doesn't support model
1013
- 4. `microsoft/Phi-3.5-vision-instruct` - Not enabled for serverless
1014
- 5. `meta-llama/Llama-3.2-11B-Vision-Instruct` - Not enabled for serverless
1015
- 6. `Qwen/Qwen2-VL-72B-Instruct` - Not enabled for serverless
1016
 
1017
- **Working Format:** Base64 encoding only
1018
 
1019
- - Base64: Works for all providers
1020
- - ❌ File path (file:// URL): Failed with 400 Bad Request
1021
- - ❌ Direct image parameter: API doesn't support
1022
 
1023
- **Critical Discovery - Large Image Handling:**
1024
 
1025
- | Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
1026
- | -------------------- | ----------------- | ------------------- | ---------------------------- |
1027
- | aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
1028
- | Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
1029
- | ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
1030
 
1031
- **API Behavior:**
 
 
 
 
 
1032
 
1033
- - Response format: Standard chat completion with content field
1034
- - Rate limits: 429 possible (command-a-vision hit this)
1035
- - Errors: Clear error messages in JSON format
1036
- - Latency: 1-3 seconds for small images, 10 seconds for large images (aya only)
1037
- - Timeout: 120 seconds default (Novita times out on large images)
1038
 
1039
- **Key Learning - Inference Provider Pattern:**
 
 
1040
 
1041
- - Models with `?inference_provider=PROVIDER` in URL ARE accessible via serverless API
1042
- - Example: `huggingface.co/Qwen/Qwen3-VL-8B-Instruct?inference_provider=novita` ✅
1043
- - Models without provider parameter (DeepSeek-OCR) require local deployment
1044
 
1045
- **Recommendation for Phase 1:**
 
1046
 
1047
- - Primary: `CohereLabs/aya-vision-32b` (handles all image sizes, Cohere provider reliable)
1048
- - Format: Base64 encode images in messages array
1049
- - Consider image preprocessing (resize/compress) for non-Cohere providers
1050
- - Set 120+ second timeouts for large images
1051
 
1052
- **HF Pro Account Context:**
1053
 
1054
- - Free accounts: $0.10/month credits, NO pay-as-you-go
1055
- - Pro accounts ($9/mo): $2.00/month credits, CAN use pay-as-you-go after credits
1056
- - Provider charges pass-through directly (no HF markup)
1057
- - Pro required for production workloads with uninterrupted access
1058
 
1059
  **Next Steps:**
1060
 
1061
- - Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
1062
- - Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
1063
- - Phase 1: Add image preprocessing for large files (resize if >1MB)
1064
 
1065
- **Test Images:**
1066
 
1067
- - `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
1068
- - `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
1069
 
1070
- ---
1071
 
1072
- ### Code Changes
1073
 
1074
- **Modified Files:**
 
 
1075
 
1076
- - **test/test_phase0_hf_vision_api.py** (NEW - ~400 lines)
1077
- - Phase 0 validation script
1078
- - Tests multiple vision models
1079
- - Tests multiple image formats
1080
- - Exports results to JSON
1081
- - OCR model testing support (image_to_text endpoint)
1082
 
1083
- **Output Files:**
 
 
1084
 
1085
- - **output/phase0_vision_validation_20260107_174401.json** - Initial test (red square image)
1086
- - **output/phase0_vision_validation_20260107_174146.json** - First attempt (no models worked)
1087
- - **output/phase0_vision_validation_20260107_182113.json** - DeepSeek-OCR test
1088
- - **output/phase0_vision_validation_20260107_182155.json** - Qwen3-VL discovery
1089
- - **output/phase0_vision_validation_20260107_184839.json** - Real image test (workspace photo)
1090
-
1091
- **Next Steps:**
1092
-
1093
- - Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
1094
- - Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
1095
- - Phase 1: Add image preprocessing for large files (resize if >1MB)
1096
-
1097
- **Test Images:**
1098
-
1099
- - `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
1100
- - `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
1101
 
1102
- ---
 
 
 
 
1103
 
1104
- ## [2026-01-06] [Plan Revision] [COMPLETED] HuggingFace Vision Integration Plan - Corrected Architecture
1105
 
1106
- **Problem:** Initial plan had critical gaps that would waste implementation time:
1107
 
1108
- - Missing Phase 0 API validation (could implement non-functional approach)
1109
- - Included fallback logic during testing (defeats isolation purpose)
1110
- - Wrong model selection order (large → small, should be small → large)
1111
- - No smoke tests before GAIA (would debug complex questions with broken integration)
1112
- - Premature cost optimization
1113
 
1114
- **Solution - Plan Corrections Applied:**
1115
 
1116
- 1. **Added Phase 0: API Validation (CRITICAL)**
1117
 
1118
- - Test HF Inference API with vision models BEFORE implementation
1119
- - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
1120
- - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
1121
- - Time saved: Prevents 2-3 hours implementing non-functional code
1122
 
1123
- 2. **Removed Fallback Logic from Testing**
1124
 
1125
- - Each provider fails independently with clear error message
1126
- - NO fallback chains (HF → Gemini → Claude) during testing
1127
- - Philosophy: Build capability knowledge, don't hide problems
1128
- - Log exact failure reasons for debugging
1129
 
1130
- 3. **Added Smoke Tests (Phase 2)**
1131
 
1132
- - 4 tests before GAIA: description, OCR, counting, single GAIA question
1133
- - Decision gate: ≥3/4 must pass before full evaluation
1134
- - Prevents debugging chess positions when basic integration broken
1135
 
1136
- 4. **Added Decision Gates**
1137
 
1138
- - Gate 1 (Phase 0): API validation → GO/NO-GO
1139
- - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
1140
- - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
1141
 
1142
- 5. **Added Backup Strategy Documentation**
 
 
 
1143
 
1144
- - Option C: HF Spaces deployment (custom endpoint)
1145
- - Option D: Local transformers library (no API)
1146
- - Option E: Hybrid (HF text + Gemini/Claude vision)
1147
 
1148
- 6. **Separate Results Per Provider**
1149
- - Export format: `gaia_results_hf_TIMESTAMP.json` (HF only)
1150
- - Build capability matrix: which provider for which tasks
1151
- - No combined/fallback results during testing
1152
 
1153
- **Modified Files:**
1154
 
1155
- - **PLAN.md** (~200 lines restructured)
1156
- - Phase 0: API Validation (NEW)
1157
- - Phase 1: Implementation (revised - no fallbacks)
1158
- - Phase 2: Smoke Tests (NEW)
1159
- - Phase 3: GAIA Evaluation (revised)
1160
- - Phase 4: Media Processing (YouTube, audio)
1161
- - Phase 5: Groq Integration (future)
1162
- - Phase 6: Final Verification
1163
- - Added: Backup Strategy Options section
1164
- - Added: Decision Gates Summary section
1165
- - Updated: Files to Modify (10 files total)
1166
- - Updated: Success Criteria (per-phase)
1167
-
1168
- **Key Changes Summary:**
1169
-
1170
- | Before | After |
1171
- | ----------------------------- | ----------------------------------- |
1172
- | Jump to implementation | Phase 0: Validate API first |
1173
- | Fallback chains | No fallbacks, fail independently |
1174
- | Large models first (Qwen2-VL) | Small models first (Phi-3.5) |
1175
- | Direct to GAIA | Smoke tests → GAIA |
1176
- | No backup plan | 3 backup options documented |
1177
- | Single success criteria | Per-phase criteria + decision gates |
1178
-
1179
- **Benefits:**
1180
 
1181
- - ✅ Prevents wasted implementation time on non-functional approach
1182
- - ✅ Clear debugging with isolated provider failures
1183
- - ✅ Faster iteration with small models
1184
- - ✅ Risk mitigation with decision gates
1185
- - ✅ Backup options if HF API doesn't support vision
1186
 
1187
- **Next Steps:** Proceed to Phase 0 (API validation) when implementation starts
 
 
 
1188
 
1189
- ---
1190
 
1191
- ## [2026-01-06] [Stage 5 Investigation] [COMPLETED] Vision Tool Ignores UI LLM Selection - Root Cause of 0% Accuracy
1192
 
1193
- **Problem:** Stage 5 claimed 25% accuracy (5/20 correct) but actual results show 0% accuracy (0/20 correct). User selected HuggingFace in UI but vision questions still failing.
1194
 
1195
- **Investigation Findings:**
1196
 
1197
- **Ground Truth Analysis (output/gaia_results_20260105_203102.json):**
1198
 
1199
- - Actual score: 0% (0/20 correct) - complete failure
1200
- - Stage 5 dev record claimed: 25% (5/20 correct) - false success claim
1201
- - Regression from baseline 10% → 0%
1202
 
1203
- **Failure Pattern Breakdown:**
 
 
 
1204
 
1205
- 1. **Vision tool failures:** 40% of questions (8/20)
1206
- - Error: "Vision analysis failed - Gemini and Claude both failed"
1207
- - Questions: Chess position, YouTube videos, audio file parsing
1208
- 2. **Calculator threading error:** 5% of questions (1/20)
1209
- - Error: "ValueError: signal only works in main thread of the main interpreter"
1210
- - Root cause: `signal.alarm()` doesn't work in Gradio async context
1211
- 3. **Wrong answers:** 55% of questions (11/20)
1212
- - Tools work, but answer synthesis produces incorrect factoids
1213
- - Example: Mercedes Sosa albums - submitted "4", correct "3"
1214
 
1215
- **Root Cause - Vision Tool Bug:**
1216
 
1217
- **Critical bug in `src/tools/vision.py:303-339`:**
1218
 
1219
- - Vision tool HARDCODED to always try Gemini → Claude fallback
1220
- - Never checks `os.getenv("LLM_PROVIDER")` setting
1221
- - Ignores UI LLM selection completely
1222
- - Other tools (planning, tool selection, synthesis) correctly respect UI selection
1223
 
1224
- **Code Evidence:**
1225
 
1226
- ```python
1227
- def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
1228
- # MISSING: No check for os.getenv("LLM_PROVIDER")
 
 
1229
 
1230
- # HARDCODED: Always try Gemini first
1231
- if settings.google_api_key:
1232
- return analyze_image_gemini(image_path, question)
1233
 
1234
- # HARDCODED: Always fallback to Claude
1235
- if settings.anthropic_api_key:
1236
- return analyze_image_claude(image_path, question)
1237
- ```
1238
 
1239
- **Impact:**
1240
 
1241
- - When user selects "HuggingFace" in UI:
1242
- - ✅ Planning uses HuggingFace
1243
- - ✅ Tool selection uses HuggingFace
1244
- - ❌ Vision still calls Gemini/Claude (ignores selection)
1245
- - Result: 40% of questions auto-fail due to Gemini/Claude quota exhaustion
1246
 
1247
- **Additional Issue:**
1248
 
1249
- - HuggingFace Inference API free tier doesn't support multimodal vision analysis
1250
- - Even if bug fixed, HF can't handle vision questions
 
 
 
1251
 
1252
- **Modified Files:**
1253
 
1254
- - **NONE** (investigation only - no code changes yet)
1255
 
1256
- **Next Steps Identified:**
1257
 
1258
- 1. Fix vision tool to respect `LLM_PROVIDER` setting
1259
- 2. Add proper error handling when HF selected for vision questions
1260
- 3. Fix calculator threading issue (`signal.alarm()` in async context)
1261
- 4. Improve answer synthesis prompts
1262
- 5. Add verification protocol: MUST verify claims with actual JSON output
1263
 
1264
- **Current Baseline:** 0% (need to fix regressions before optimizing)
1265
- **Target:** 30% minimum (6/20 questions)
1266
 
1267
- ---
 
 
 
1268
 
1269
- ## [2026-01-05] [Runtime Cache Folder] [COMPLETED] Eliminate exports/ Redundancy
1270
 
1271
- **Problem:**
 
1272
 
1273
- - Environment-dependent paths: `~/Downloads` (local) vs `./exports` (HF Spaces)
1274
- - `exports/` folder name confusing - looked like user-facing folder
1275
- - Files visible in HF UI when committed to git
1276
- - User couldn't locate where files were saved
1277
 
1278
- **Solution:**
1279
-
1280
- - Single `_cache/` folder for all environments (local, HF Spaces)
1281
- - Name clearly indicates internal runtime storage (not user-accessible via file browser)
1282
- - Files served via app download button, not HF Spaces UI
1283
- - Added to .gitignore to keep runtime files out of git
1284
 
1285
- **Modified Files:**
1286
 
1287
- - **app.py** (~10 lines modified)
1288
 
1289
- - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
1290
- - Changed: `exports/` → `_cache/`
1291
- - Updated docstring: "All environments: Saves to ./\_cache/gaia_results_TIMESTAMP.json"
1292
- - Updated comment: "Save to \_cache/ folder (internal runtime storage, not accessible via HF UI)"
1293
 
1294
- - **.gitignore** (~3 lines added)
1295
- - Added `_cache/` to ignore list
1296
- - Added comment explaining runtime cache behavior
 
 
 
1297
 
1298
- **Benefits:**
1299
 
1300
- - ✅ Single location for all environments (no environment detection)
1301
- - Clear naming indicates internal storage (not user-facing)
1302
- - ✅ Files accessible via download button
1303
- - ✅ Not visible in HF Spaces file browser
1304
- - ✅ Not committed to git
1305
 
1306
- **File Lifecycle on HF Spaces:**
1307
 
1308
- - Files persist on server between runs (accumulate in `_cache/`)
1309
- - Wiped clean on redeploy (container rebuild)
1310
- - Standard container behavior: runtime storage is temporary
1311
- - No manual cleanup needed (redeploy handles it)
 
1
  # Session Changelog
2
 
3
+ ## [2026-01-14] [Enhancement] [COMPLETED] Unified Log Format - Markdown Standard
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
+ **Problem:** Inconsistent log formats across different components, wasteful `====` separators.
 
 
 
6
 
7
+ **Solution:** Standardize all logs to Markdown format with clean structure.
8
 
9
+ **Unified Log Standard:**
10
 
11
+ ```markdown
12
+ # Title
13
 
14
+ **Key:** value
15
+ **Key:** value
16
 
17
+ ## Section
18
 
19
+ Content
20
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+ **Files Updated:**
23
 
24
+ 1. **LLM Session Logs** (`llm_session_*.md`):
25
 
26
+ - Header: `# LLM Synthesis Session Log`
27
+ - Questions: `## Question [timestamp]`
28
+ - Sections: `### Evidence & Prompt`, `### LLM Response`
29
+ - Code blocks: triple backticks
30
 
31
+ 2. **YouTube Transcript Logs** (`{video_id}_transcript.md`):
32
+ - Header: `# YouTube Transcript`
33
+ - Metadata: `**Video ID:**`, `**Source:**`, `**Length:**`
34
+ - Content: `## Transcript`
35
 
36
+ **Note:** No horizontal rules (`---`) - already banned in global CLAUDE.md, breaks collapsible sections
37
 
38
+ **Token Savings:**
39
 
40
+ | Style | Tokens per separator | 20 questions |
41
+ | ----------------- | -------------------- | ------------ |
42
+ | `====` x 80 chars | ~40 tokens | ~800 tokens |
43
+ | `##` heading | ~2 tokens | ~40 tokens |
44
 
45
+ **Savings:** ~760 tokens per session (95% reduction)
46
 
47
+ **Benefits:**
 
48
 
49
+ - Collapsible headings in all Markdown editors
50
+ - Consistent structure across all log files
51
+ - Token-efficient for LLM processing
52
+ - Readable in both rendered and plain text
53
+ - ✅ `.md` extension for proper syntax highlighting
54
 
55
  **Modified Files:**
56
 
57
+ - `src/agent/llm_client.py` (LLM session logs)
58
+ - `src/tools/youtube.py` (transcript logs)
59
+ - `CLAUDE.md` (added unified log format standard)
 
 
 
60
 
61
+ ## [2026-01-14] [Cleanup] [COMPLETED] Session Log Optimization - Reduce Static Content Redundancy
62
 
63
+ **Problem:** System prompt (~30 lines) was written for every question (20x = 600 lines of redundant text).
64
 
65
+ **Solution:** Write system prompt once on first question, skip for subsequent questions.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  **Implementation:**
68
 
69
+ - Added `_SYSTEM_PROMPT_WRITTEN` flag to track if system prompt was logged
70
+ - First question includes full SYSTEM PROMPT section
71
+ - Subsequent questions only show dynamic content (question, evidence, response)
 
 
 
 
72
 
73
+ **Log format comparison:**
 
74
 
75
+ Before (every question):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ```
78
+ QUESTION START
79
+ SYSTEM PROMPT: [30 lines repeated]
80
+ USER PROMPT: [dynamic]
81
+ LLM RESPONSE: [dynamic]
 
 
 
82
  ```
83
 
84
+ After (first question):
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
 
 
86
  ```
87
+ SYSTEM PROMPT (static - used for all questions): [30 lines]
88
+ QUESTION [...]
89
+ EVIDENCE & PROMPT: [dynamic]
90
+ LLM RESPONSE: [dynamic]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
+ After (subsequent questions):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ```
96
+ QUESTION [...]
97
+ EVIDENCE & PROMPT: [dynamic]
98
+ LLM RESPONSE: [dynamic]
 
 
 
 
99
  ```
100
 
101
+ **Result:** ~570 lines less redundancy per 20-question evaluation.
 
 
 
 
 
 
 
 
 
 
 
102
 
103
  **Modified Files:**
104
 
105
+ - `src/agent/llm_client.py` (~30 lines modified - added flag, conditional logging)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
+ ## [2026-01-14] [Bugfix] [COMPLETED] Session Log Synchronization - Atomic Per-Question Logging
108
 
109
+ **Problem:** When processing multiple questions, LLM responses were written out of order relative to their questions, causing mismatched prompts/responses in session logs.
110
 
111
+ **Root Cause:** `synthesize_answer_hf()` wrote QUESTION START immediately, but appended LLM RESPONSE later after API call completed. With concurrent processing, responses finished in different order.
112
 
113
+ **Solution:** Buffer complete question block in memory, write atomically when response arrives:
 
 
 
114
 
115
+ ```python
116
+ # Before (broken):
117
+ write_question_start() # immediate
118
+ api_response = call_llm()
119
+ write_llm_response() # later, out of order
120
+
121
+ # After (fixed):
122
+ question_header = buffer_question_start()
123
+ api_response = call_llm()
124
+ complete_block = question_header + response + end
125
+ write_atomic(complete_block) # all at once
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ```
127
 
128
+ **Result:** Each question block is self-contained, no mismatched prompts/responses.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  **Modified Files:**
131
 
132
+ - `src/agent/llm_client.py` (~40 lines modified - synthesize_answer_hf function)
 
 
 
 
 
 
133
 
134
+ ## [2026-01-13] [Cleanup] [COMPLETED] LLM Session Log Format - Removed Duplicate Evidence
135
 
136
+ **Problem:** Evidence appeared twice in session log - once in USER PROMPT section, again in EVIDENCE ITEMS section.
137
 
138
+ **Solution:** Removed standalone EVIDENCE ITEMS section, kept evidence in USER PROMPT only.
 
 
 
 
139
 
140
+ **Rationale:** USER PROMPT shows what's actually sent to the LLM (system + user messages together).
 
 
 
 
141
 
142
  **Modified Files:**
143
 
144
+ - `src/agent/llm_client.py` - Removed duplicate logging section (lines 1189-1194 deleted)
 
 
 
 
 
145
 
146
+ **Result:** Cleaner logs, no duplication
 
 
 
 
 
 
 
 
147
 
148
+ ## [2026-01-13] [Feature] [COMPLETED] YouTube Frame Processing Mode - Visual Video Analysis
 
 
 
149
 
150
+ **Problem:** Transcript mode captures audio but misses visual information (objects, scenes, actions).
151
 
152
+ **Solution:** Implemented frame extraction and vision-based video analysis mode.
 
 
 
 
 
 
 
 
153
 
154
  **Implementation:**
155
 
156
+ **1. Frame Extraction (`src/tools/youtube.py`):**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
+ - `download_video()` - Downloads video using yt-dlp
159
+ - `extract_frames()` - Extracts N frames at regular intervals using OpenCV
160
+ - `analyze_frames()` - Analyzes frames with vision models
161
+ - `process_video_frames()` - Complete frame processing pipeline
162
+ - `youtube_analyze()` - Unified API with mode parameter
 
163
 
164
+ **2. CONFIG Settings:**
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
+ - `FRAME_COUNT = 6` - Number of frames to extract
167
+ - `FRAME_QUALITY = "worst"` - Download quality (faster)
 
 
168
 
169
+ **3. UI Integration (`app.py`):**
170
 
171
+ - Added radio button: "YouTube Processing Mode"
172
+ - Choices: "Transcript" (default) or "Frames"
173
+ - Sets `YOUTUBE_MODE` environment variable
 
 
 
 
 
 
 
 
 
 
174
 
175
+ **4. Updated Dependencies:**
 
 
 
 
176
 
177
+ - `requirements.txt` - Added `opencv-python>=4.8.0`
178
+ - `pyproject.toml` - Added via `uv add opencv-python`
179
 
180
+ **5. Tool Description Update (`src/tools/__init__.py`):**
181
 
182
+ - Updated `youtube_transcript` description to mention both modes
183
 
184
+ **Architecture:**
185
 
 
 
186
  ```
187
+ youtube_transcript() → reads YOUTUBE_MODE env
188
+ ├─ "transcript" audio/subtitle extraction
189
+ └─ "frames" → video download → extract 6 frames → vision analysis
 
 
 
 
 
 
 
 
 
 
 
190
  ```
191
 
192
+ **Test Result:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
 
194
+ - Successfully processed video with 6 frames analyzed
195
+ - Each frame analyzed with vision model, combined output returned
196
+ - Frame timestamps: 0s, 20s, 40s, 60s, 80s, 100s (spread evenly)
197
 
198
+ **Known Limitation:**
199
 
200
+ - Frame sampling is random (regular intervals)
201
+ - Low probability of capturing transient events (~5.5% for 108s video)
202
+ - Future: Hybrid mode using timestamps to guide frame extraction (documented in `user_io/knowledge/hybrid_video_audio_analysis.md`)
203
 
204
+ **Status:** Implemented and tested, ready for use
 
 
205
 
206
  **Modified Files:**
207
 
208
+ - `src/tools/youtube.py` (~200 lines added - frame extraction + analysis)
209
+ - `app.py` (~5 lines modified - UI toggle)
210
+ - `requirements.txt` (1 line added - opencv-python)
211
+ - `src/tools/__init__.py` (1 line modified - tool description)
 
 
212
 
213
+ ## [2026-01-13] [Investigation] [OPEN] HF Spaces vs Local Performance Discrepancy
214
 
215
+ **Problem:** HF Space deployment shows significantly lower scores (5%) than local execution (20-30%).
216
 
217
  **Investigation:**
218
 
219
+ | Environment | Score | System Errors | NoneType Errors |
220
+ | ---------------- | ------ | ------------- | --------------- |
221
+ | **Local** | 20-30% | 3 (15%) | 1 |
222
+ | **HF ZeroGPU** | 5% | 5 (25%) | 3 |
223
+ | **HF CPU Basic** | 5% | 5 (25%) | 3 |
224
 
225
+ **Verified:** Code is 100% identical (cloned HF Space repo, git history matches at commit `3dcf523`).
226
 
227
+ **Issue:** HF Spaces infrastructure causes LLM to return empty/None responses during synthesis.
228
 
229
+ **Known Limitations (Local 30% Run):**
 
 
 
230
 
231
+ - 3 system errors: reverse text (calculator), chess vision (NoneType), Python .py execution
232
+ - 10 "Unable to answer": search evidence extraction issues
233
+ - 1 wrong answer: Wikipedia dinosaur (Jimfbleak vs FunkMonk)
234
 
235
+ **Resolution:** Competition accepts local results. HF Spaces deployment not required.
 
 
 
 
 
 
236
 
237
+ **Status:** OPEN - Infrastructure Issue, Won't Fix (use local execution)
 
 
 
 
 
 
 
238
 
239
+ ## [2026-01-13] [Infrastructure] [COMPLETED] 3-Tier Folder Naming Convention
240
 
241
+ **Problem:** Previous rename used `_` prefix for both runtime folders AND user-only folders, creating ambiguity.
 
 
242
 
243
+ **Solution:** Implemented 3-tier naming convention to clearly distinguish folder purposes.
244
 
245
+ **3-Tier Convention:**
 
 
246
 
247
+ 1. **User-only** (`user_*` prefix) - Manual use, not app runtime:
248
 
249
+ - `user_input/` - User testing files, not app input
250
+ - `user_output/` - User downloads, not app output
251
+ - `user_dev/` - Dev records (manual documentation)
252
+ - `user_archive/` - Archived code/reference materials
253
 
254
+ 2. **Runtime/Internal** (`_` prefix) - App creates, temporary:
255
 
256
+ - `_cache/` - Runtime cache, served via app download
257
+ - `_log/` - Runtime logs, debugging
258
 
259
+ 3. **Application** (no prefix) - Permanent code:
260
+ - `src/`, `test/`, `docs/`, `ref/` - Application folders
261
 
262
+ **Folders Renamed:**
 
 
 
263
 
264
+ - `_input/` → `user_input/` (user testing files)
265
+ - `_output/` → `user_output/` (user downloads)
266
+ - `dev/` → `user_dev/` (dev records)
267
+ - `archive/` → `user_archive/` (archived materials)
268
 
269
+ **Folders Unchanged (correct tier):**
 
 
 
 
 
 
270
 
271
+ - `_cache/`, `_log/` - Runtime ✓
272
+ - `src/`, `test/`, `docs/`, `ref/` - Application ✓
273
 
274
+ **Updated Files:**
275
 
276
+ - **test/test_phase0_hf_vision_api.py** - `Path("_output")` `Path("user_output")`
277
+ - **.gitignore** - Updated folder references and comments
278
 
279
+ **Git Status:**
280
 
281
+ - Old folders removed from git tracking
282
+ - New folders excluded by .gitignore
283
+ - Existing files become untracked
284
 
285
+ **Result:** Clear 3-tier structure: user*\*, *\*, and no prefix
286
 
287
+ ## [2026-01-13] [Infrastructure] [COMPLETED] Runtime Folder Naming Convention - Underscore Prefix
288
 
289
+ **Problem:** Folders `log/`, `output/`, and `input/` didn't clearly indicate they were runtime-only storage, making it unclear which folders are internal vs permanent.
 
 
290
 
291
+ **Solution:** Renamed all runtime-only folders to use `_` prefix, following Python convention for internal/private.
292
 
293
+ **Folders Renamed:**
 
 
 
 
 
294
 
295
+ - `log/` → `_log/` (runtime logs, debugging)
296
+ - `output/` → `_output/` (runtime results, user downloads)
297
+ - `input/` → `_input/` (user testing files, not app input)
298
 
299
+ **Rationale:**
 
 
 
 
 
300
 
301
+ - `_` prefix signals "internal, temporary, not part of public API"
302
+ - Consistent with Python convention (`_private`, `__dunder__`)
303
+ - Distinguishes runtime storage from permanent project folders
304
 
305
+ **Updated Files:**
 
 
 
306
 
307
+ - `src/agent/llm_client.py` - `Path("log")` `Path("_log")`
308
+ - `src/tools/youtube.py` - `Path("log")` → `Path("_log")`
309
+ - `test/test_phase0_hf_vision_api.py` - `Path("output")` → `Path("_output")`
310
+ - `.gitignore` - Updated folder references
311
 
312
+ **Result:** Runtime folders now clearly marked with `_` prefix
313
 
314
+ ## [2026-01-13] [Documentation] [COMPLETED] Log Consolidation - Session-Level Logging
315
 
316
+ **Problem:** Each question created separate log file (`llm_context_TIMESTAMP.txt`), polluting the log/ folder with 20+ files per evaluation.
317
 
318
+ **Solution:** Implemented session-level log file where all questions append to single file.
319
 
320
+ **Implementation:**
321
 
322
+ - Added `get_session_log_file()` function in `src/agent/llm_client.py`
323
+ - Creates `log/llm_session_YYYYMMDD_HHMMSS.txt` on first use
324
+ - All questions append to same file with question delimiters
325
+ - Added `reset_session_log()` for testing/new runs
326
 
327
+ **Updated File:**
 
 
328
 
329
+ - `src/agent/llm_client.py` (~40 lines added)
330
+ - Session log management (lines 62-99)
331
+ - Updated `synthesize_answer_hf` to append to session log
332
 
333
+ **Result:** One log file per evaluation instead of 20+
 
 
 
334
 
335
+ ## [2026-01-13] [Infrastructure] [COMPLETED] Project Template Reference Move
336
 
337
+ **Problem:** Project template moved to new location, documentation references outdated.
 
 
 
338
 
339
+ **Solution:** Updated CHANGELOG.md references to new template location.
340
 
341
+ **Changes:**
342
 
343
+ - Moved: `project_template_original/` `ref/project_template_original/`
344
+ - Updated CHANGELOG.md (7 occurrences)
345
+ - Added `ref/` to .gitignore (static copies, not in git)
346
 
347
+ **Result:** Documentation reflects new template location
348
 
349
+ ## [2026-01-12] [Infrastructure] [COMPLETED] Git Ignore Fixes - PDF Commit Block
350
 
351
+ **Problem:** Git push rejected due to binary files in `docs/` folder.
352
 
353
+ **Solution:**
354
 
355
+ 1. Reset commit: `git reset --soft HEAD~1`
356
+ 2. Added `docs/*.pdf` to .gitignore
357
+ 3. Removed PDF files from git: `git rm --cached "docs/*.pdf"`
358
+ 4. Recommitted without PDFs
359
+ 5. Push successful
360
 
361
+ **User feedback:** "can just gitignore all the docs also"
362
 
363
+ **Final Fix:** Changed `docs/*.pdf` to `docs/` to ignore entire docs folder
 
 
 
 
364
 
365
+ **Updated Files:**
366
 
367
+ - `.gitignore` - Added `docs/` folder ignore
 
 
 
 
 
 
368
 
369
+ **Result:** Clean git history, no binary files committed
370
 
371
+ ## [2026-01-13] [Documentation] [COMPLETED] 30% Results Analysis - Phase 1 Success
 
 
 
 
 
 
 
 
372
 
373
+ **Problem:** Need to analyze results to understand what's working and what needs improvement.
374
 
375
+ **Analysis of gaia_results_20260113_174815.json (30% score):**
 
 
376
 
377
+ **Results Breakdown:**
378
 
379
+ - **6 Correct** (30%):
 
 
 
 
380
 
381
+ - `a1e91b78` (YouTube bird count) - Phase 1 fix working ✓
382
+ - `9d191bce` (YouTube Teal'c) - Phase 1 fix working ✓
383
+ - `6f37996b` (CSV table) - Calculator working ✓
384
+ - `1f975693` (Calculus MP3) - Audio transcription working ✓
385
+ - `99c9cc74` (Strawberry pie MP3) - Audio transcription working ✓
386
+ - `7bd855d8` (Excel food sales) - File parsing working ✓
387
 
388
+ - **3 System Errors** (15%):
 
 
 
 
389
 
390
+ - `2d83110e` (Reverse text) - Calculator: SyntaxError
391
+ - `cca530fc` (Chess position) - NoneType error (vision)
392
+ - `f918266a` (Python code) - parse_file: ValueError
393
 
394
+ - **10 "Unable to answer"** (50%):
 
 
395
 
396
+ - Search evidence extraction insufficient
397
+ - Need better LLM prompts or search processing
398
 
399
+ - **1 Wrong Answer** (5%):
400
+ - `4fc2f1ae` (Wikipedia dinosaur) - Found "Jimfbleak" instead of "FunkMonk"
 
 
401
 
402
+ **Phase 1 Impact (YouTube + Audio):**
403
 
404
+ - Fixed 4 questions that would have failed before
405
+ - YouTube transcription with Whisper fallback working
406
+ - Audio transcription working well
 
407
 
408
  **Next Steps:**
409
 
410
+ 1. Fix 3 system errors (text manipulation, vision NoneType, Python execution)
411
+ 2. Improve search evidence extraction (10 questions)
412
+ 3. Investigate wrong answer (Wikipedia search precision)
413
 
414
+ ## [2026-01-13] [Feature] [COMPLETED] Phase 1: YouTube + Audio Transcription Support
415
 
416
+ **Problem:** Questions with YouTube videos and audio files couldn't be answered.
 
417
 
418
+ **Solution:** Implemented two-phase transcription system.
419
 
420
+ **YouTube Transcription (`src/tools/youtube.py`):**
421
 
422
+ - Extracts transcript using `youtube_transcript_api`
423
+ - Falls back to Whisper audio transcription if captions unavailable
424
+ - Saves transcript to `_log/{video_id}_transcript.txt`
425
 
426
+ **Audio Transcription (`src/tools/audio.py`):**
 
 
 
 
 
427
 
428
+ - Uses Groq's Whisper-large-v3 model (ZeroGPU compatible)
429
+ - Supports MP3, WAV, M4A, OGG, FLAC, AAC formats
430
+ - Saves transcript to `_log/` for debugging
431
 
432
+ **Impact:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
433
 
434
+ - 4 additional questions answered correctly (30% vs ~10% before)
435
+ - `9d191bce` (YouTube Teal'c) - "Extremely" ✓
436
+ - `a1e91b78` (YouTube birds) - "3" ✓
437
+ - `1f975693` (Calculus MP3) - "132, 133, 134, 197, 245" ✓
438
+ - `99c9cc74` (Strawberry pie MP3) - Full ingredient list ✓
439
 
440
+ **Status:** Phase 1 complete, hit 30% target score
441
 
442
+ ## [2026-01-12] [Infrastructure] [COMPLETED] Session Log Implementation
443
 
444
+ **Problem:** Need to track LLM synthesis context for debugging and analysis.
 
 
 
 
445
 
446
+ **Solution:** Created session-level logging system in `src/agent/llm_client.py`.
447
 
448
+ **Implementation:**
449
 
450
+ - Session log: `_log/llm_session_YYYYMMDD_HHMMSS.txt`
451
+ - Per-question log: `_log/{video_id}_transcript.txt` (YouTube only)
452
+ - Captures: questions, evidence items, LLM prompts, answers
453
+ - Structured format with timestamps and delimiters
454
 
455
+ **Result:** Full audit trail for debugging failed questions
456
 
457
+ ## [2026-01-13] [Infrastructure] [COMPLETED] Git Commit & HF Push
 
 
 
458
 
459
+ **Problem:** Need to deploy changes to HuggingFace Spaces.
460
 
461
+ **Solution:** Committed and pushed latest changes.
 
 
462
 
463
+ **Commit:** `3dcf523` - "refactor: update folder structure and adjust output paths"
464
 
465
+ **Changes Deployed:**
 
 
466
 
467
+ - 3-tier folder naming convention
468
+ - Session-level logging
469
+ - Project template reference move
470
+ - Git ignore fixes
471
 
472
+ **Result:** HF Space updated with latest code
 
 
473
 
474
+ ## [2026-01-13] [Testing] [COMPLETED] Phase 0 Vision API Validation
 
 
 
475
 
476
+ **Problem:** Need to validate vision API works before integrating into agent.
477
 
478
+ **Solution:** Created test suite `test/test_phase0_hf_vision_api.py`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
479
 
480
+ **Test Results:**
 
 
 
 
481
 
482
+ - Tested 4 image sources
483
+ - Validated multimodal LLM responses
484
+ - Confirmed HF Inference API compatibility
485
+ - Identified NoneType edge case (empty responses)
486
 
487
+ **File:** `user_io/result_ServerApp/phase0_vision_validation_*.json`
488
 
489
+ **Result:** Vision API validated, ready for integration
490
 
491
+ ## [2026-01-11] [Feature] [COMPLETED] Multi-Modal Vision Support
492
 
493
+ **Problem:** Agent couldn't process image-based questions (chess positions, charts, etc.).
494
 
495
+ **Solution:** Implemented vision tool using HuggingFace Inference API.
496
 
497
+ **Implementation (`src/tools/vision.py`):**
 
 
498
 
499
+ - `analyze_image()` - Main vision analysis function
500
+ - Supports JPEG, PNG, GIF, BMP, WebP formats
501
+ - Returns detailed descriptions of visual content
502
+ - Fallback to Gemini/Claude if HF fails
503
 
504
+ **Status:** Implemented, some NoneType errors remain
 
 
 
 
 
 
 
 
505
 
506
+ ## [2026-01-10] [Feature] [COMPLETED] File Parser Tool
507
 
508
+ **Problem:** Agent couldn't read uploaded files (PDF, Excel, Word, CSV, etc.).
509
 
510
+ **Solution:** Implemented unified file parser (`src/tools/file_parser.py`).
 
 
 
511
 
512
+ **Supported Formats:**
513
 
514
+ - PDF (`parse_pdf`) - PyPDF2 extraction
515
+ - Excel (`parse_excel`) - Calamine-based parsing
516
+ - Word (`parse_word`) - python-docx extraction
517
+ - Text/CSV (`parse_text`) - UTF-8 text reading
518
+ - Unified `parse_file()` - Auto-detects format
519
 
520
+ **Result:** Agent can now read file attachments
 
 
521
 
522
+ ## [2026-01-09] [Feature] [COMPLETED] Calculator Tool
 
 
 
523
 
524
+ **Problem:** Agent couldn't perform mathematical calculations.
525
 
526
+ **Solution:** Implemented safe expression evaluator (`src/tools/calculator.py`).
 
 
 
 
527
 
528
+ **Features:**
529
 
530
+ - `safe_eval()` - Safe math expression evaluation
531
+ - Supports: arithmetic, algebra, trigonometry, logarithms
532
+ - Constants: pi, e
533
+ - Functions: sqrt, sin, cos, log, abs, etc.
534
+ - Error handling for invalid expressions
535
 
536
+ **Result:** CSV table question answered correctly (`6f37996b`)
537
 
538
+ ## [2026-01-08] [Feature] [COMPLETED] Web Search Tool
539
 
540
+ **Problem:** Agent couldn't access current information beyond training data.
541
 
542
+ **Solution:** Implemented web search using Tavily API (`src/tools/web_search.py`).
 
 
 
 
543
 
544
+ **Features:**
 
545
 
546
+ - `tavily_search()` - Primary search via Tavily
547
+ - `exa_search()` - Fallback via Exa (if available)
548
+ - Unified `search()` - Auto-fallback chain
549
+ - Returns structured results with titles, snippets, URLs
550
 
551
+ **Configuration:**
552
 
553
+ - `TAVILY_API_KEY` required
554
+ - `EXA_API_KEY` optional (fallback)
555
 
556
+ **Result:** Agent can now search web for current information
 
 
 
557
 
558
+ ## [2026-01-07] [Infrastructure] [COMPLETED] Project Initialization
 
 
 
 
 
559
 
560
+ **Problem:** New project setup required.
561
 
562
+ **Solution:** Initialized project structure with standard files.
563
 
564
+ **Created:**
 
 
 
565
 
566
+ - `README.md` - Project documentation
567
+ - `CLAUDE.md` - Project-specific AI instructions
568
+ - `CHANGELOG.md` - Session tracking
569
+ - `.gitignore` - Git exclusions
570
+ - `requirements.txt` - Dependencies
571
+ - `pyproject.toml` - UV package config
572
 
573
+ **Result:** Project scaffold ready for development
574
 
575
+ **Date:** YYYY-MM-DD
576
+ **Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
 
 
 
577
 
578
+ ## What Was Changed
579
 
580
+ - Change 1
581
+ - Change 2
 
 
CLAUDE.md CHANGED
@@ -4,26 +4,57 @@
4
 
5
  ## Logging Standard
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  **Console Output (Status Workflow):**
8
  - **Compressed status updates:** `[node] ✓ result` or `[node] ✗ error`
9
  - **Progress indicators:** `[1/1] Processing task_id`, `[1/20]` for batch
10
  - **Key milestones only:** 3-4 statements vs verbose logs
11
  - **Node labels:** `[plan]`, `[execute]`, `[answer]` with success/failure
12
 
13
- **Log Files (log/ folder):**
14
- - **llm_context_TIMESTAMP.txt** - Full LLM prompts, evidence, answers for debugging
15
- - **{video_id}_transcript.txt** - Raw transcripts from YouTube/Whisper
16
  - **Purpose:** Post-run analysis, context preservation, audit trail
17
- - **Format:** Structured headers with timestamp, question, evidence items, full content
18
 
19
- **Log Format Examples:**
20
  ```
21
  [plan] ✓ 660 chars
22
  [execute] 1 tool(s) selected
23
  [1/1] youtube_transcript ✓
24
  [execute] 1 tools, 1 evidence
25
  [answer] ✓ 3
26
- Context saved to: log/llm_context_20260113_022706.txt
27
  ```
28
 
29
  **Note:** Explicit user request overrides global rule about "no logs/ folder"
 
4
 
5
  ## Logging Standard
6
 
7
+ **Unified Log Format (All log files MUST use Markdown):**
8
+
9
+ - File extension: `.md` (not `.txt`)
10
+ - Headers: `# Title`, `## Section`, `### Subsection`
11
+ - Metadata: `**Key:** value`
12
+ - Code blocks: Triple backticks with language identifier
13
+ - Token-efficient: Use `##` headings instead of `====` separators (95% token savings)
14
+
15
+ **Log File Structure Template:**
16
+ ```markdown
17
+ # Log Title
18
+
19
+ **Session Start:** YYYY-MM-DDTHH:MM:SS
20
+ **Key:** value
21
+
22
+ ## Section [timestamp]
23
+
24
+ **Question:** ...
25
+ **Evidence items:** N
26
+
27
+ ### Subsection
28
+
29
+ ```text
30
+ Content here
31
+ ```
32
+
33
+ **Result:** value
34
+
35
+ ## Next Section
36
+ ```
37
+
38
  **Console Output (Status Workflow):**
39
  - **Compressed status updates:** `[node] ✓ result` or `[node] ✗ error`
40
  - **Progress indicators:** `[1/1] Processing task_id`, `[1/20]` for batch
41
  - **Key milestones only:** 3-4 statements vs verbose logs
42
  - **Node labels:** `[plan]`, `[execute]`, `[answer]` with success/failure
43
 
44
+ **Log Files (_log/ folder):**
45
+ - `llm_session_*.md` - LLM synthesis session with questions, evidence, responses
46
+ - `{video_id}_transcript.md` - Raw transcripts from YouTube/Whisper
47
  - **Purpose:** Post-run analysis, context preservation, audit trail
48
+ - **Benefits:** Collapsible headings in editors, token-efficient, readable in plain text
49
 
50
+ **Console Format Example:**
51
  ```
52
  [plan] ✓ 660 chars
53
  [execute] 1 tool(s) selected
54
  [1/1] youtube_transcript ✓
55
  [execute] 1 tools, 1 evidence
56
  [answer] ✓ 3
57
+ Session saved to: _log/llm_session_20260113_022706.md
58
  ```
59
 
60
  **Note:** Explicit user request overrides global rule about "no logs/ folder"
app.py CHANGED
@@ -421,6 +421,7 @@ def process_single_question(agent, item, index, total):
421
 
422
  def run_and_submit_all(
423
  llm_provider: str,
 
424
  question_limit: int = 0,
425
  task_ids: str = "",
426
  profile: gr.OAuthProfile | None = None,
@@ -431,6 +432,7 @@ def run_and_submit_all(
431
 
432
  Args:
433
  llm_provider: LLM provider to use
 
434
  question_limit: Limit number of questions (0 = process all)
435
  task_ids: Comma-separated task IDs to target (overrides question_limit)
436
  profile: OAuth profile for HF login
@@ -456,6 +458,10 @@ def run_and_submit_all(
456
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
457
  logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
458
 
 
 
 
 
459
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
460
  try:
461
  logger.info("Initializing GAIAAgent...")
@@ -728,6 +734,12 @@ with gr.Blocks() as demo:
728
  value="HuggingFace",
729
  info="Select which LLM to use for all questions",
730
  )
 
 
 
 
 
 
731
  eval_question_limit = gr.Number(
732
  label="Question Limit (Debug)",
733
  value=0,
@@ -760,6 +772,7 @@ with gr.Blocks() as demo:
760
  fn=run_and_submit_all,
761
  inputs=[
762
  eval_llm_provider_dropdown,
 
763
  eval_question_limit,
764
  eval_task_ids,
765
  ],
 
421
 
422
  def run_and_submit_all(
423
  llm_provider: str,
424
+ video_mode: str = "Transcript",
425
  question_limit: int = 0,
426
  task_ids: str = "",
427
  profile: gr.OAuthProfile | None = None,
 
432
 
433
  Args:
434
  llm_provider: LLM provider to use
435
+ video_mode: YouTube processing mode ("Transcript" or "Frames")
436
  question_limit: Limit number of questions (0 = process all)
437
  task_ids: Comma-separated task IDs to target (overrides question_limit)
438
  profile: OAuth profile for HF login
 
458
  os.environ["LLM_PROVIDER"] = llm_provider.lower()
459
  logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
460
 
461
+ # Set YouTube video processing mode from UI selection
462
+ os.environ["YOUTUBE_MODE"] = video_mode.lower()
463
+ logger.info(f"UI Config for Full Evaluation: YOUTUBE_MODE={video_mode}")
464
+
465
  # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
466
  try:
467
  logger.info("Initializing GAIAAgent...")
 
734
  value="HuggingFace",
735
  info="Select which LLM to use for all questions",
736
  )
737
+ eval_video_mode = gr.Radio(
738
+ label="YouTube Processing Mode",
739
+ choices=["Transcript", "Frames"],
740
+ value="Transcript",
741
+ info="Transcript: Audio/subtitle extraction (fast) | Frames: Visual analysis with vision models (slower)",
742
+ )
743
  eval_question_limit = gr.Number(
744
  label="Question Limit (Debug)",
745
  value=0,
 
772
  fn=run_and_submit_all,
773
  inputs=[
774
  eval_llm_provider_dropdown,
775
+ eval_video_mode,
776
  eval_question_limit,
777
  eval_task_ids,
778
  ],
brainstorming_phase1_youtube.md DELETED
@@ -1,446 +0,0 @@
1
- # Phase 1 Brainstorming - YouTube Transcript Support
2
-
3
- **Date:** 2026-01-13
4
- **Status:** Discussion Phase
5
- **Goal:** Fix questions #3 and #5 (YouTube videos) → 40% score
6
-
7
- ---
8
-
9
- ## Question Analysis
10
-
11
- | Question | Task ID | Description | Expected Answer | Type |
12
- | -------- | -------------------------------------- | ------------------------------- | --------------- | ------------- |
13
- | #3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | "3" | Content-based |
14
- | #5 | (Teal'c quote) | YouTube video - character quote | "Extremely" | Dialogue |
15
-
16
- **Conclusion:** Both are content-based questions → transcript approach should work ✅
17
-
18
- ---
19
-
20
- ## Library Options
21
-
22
- ### Option A: youtube-transcript-api ⭐ Recommended
23
-
24
- - **Pros:** Simple API, actively maintained, no video download needed, fast
25
- - **Cons:** May fail on videos without captions, regional restrictions
26
- - **Use case:** Start here for simplicity
27
-
28
- ### Option B: yt-dlp + transcript extraction
29
-
30
- - **Pros:** More robust, can fall back to auto-generated captions
31
- - **Cons:** Heavier dependency, slower
32
- - **Use case:** Backup if Option A has high failure rate
33
-
34
- ### Option C: Direct YouTube API
35
-
36
- - **Pros:** Most control
37
- - **Cons:** Requires API key, more complex
38
- - **Use case:** Probably overkill for this use case
39
-
40
- ---
41
-
42
- ## Frame Extraction: Corrected Analysis
43
-
44
- **Key insight:** Frame extraction itself is FAST. The "slow" parts are download + vision API processing.
45
-
46
- ### Actual Timing Breakdown
47
-
48
- | Step | Time (10-min video) | Notes |
49
- | -------------------- | ------------------- | -------------------------------------- |
50
- | **Download** | 30s - 3 min | Network I/O, one-time cost |
51
- | **Frame extraction** | **5 - 20 sec** | ffmpeg is I/O bound, very efficient ⚡ |
52
- | **Vision API calls** | 20s - 5 min | Sequential: 600 frames × 2-5s each |
53
-
54
- **Reality check:** You can extract 600 frames from a local 10-min video in under 15 seconds with ffmpeg. The "slow" part is vision model API calls, not the extraction.
55
-
56
- **Bottom line:** Frame extraction is cheap compute. Vision processing is expensive compute.
57
-
58
- ### Comparison
59
-
60
- | Approach | What's Fast | What's Slow | Total Time |
61
- | -------------------- | ------------------ | ------------------------------------------- | ---------------- |
62
- | **Transcript** | API call (1-3s) | - | **1-3 seconds** |
63
- | **Frame Extraction** | Extraction (5-20s) | Download (30s-3min) + Vision API (20s-5min) | **1-10 minutes** |
64
-
65
- ### Do Tools Matter?
66
-
67
- | Tool | Speed (extraction only) | Verdict |
68
- | ------- | ----------------------- | --------------- |
69
- | ffmpeg | ⚡⚡⚡ Fastest (5-10s) | Best choice |
70
- | OpenCV | ⚡⚡ Fast (10-20s) | Standard choice |
71
- | moviepy | ⚡ Medium (20-40s) | Python overhead |
72
-
73
- **For extraction alone:** Tools matter, but all are fast enough.
74
-
75
- ### When Is Frame Extraction Worth It?
76
-
77
- **Only when:**
78
-
79
- - Question is purely visual (no audio/transcript available)
80
- - Visual information is NOT in video thumbnail/title/description
81
- - You have no other choice
82
-
83
- **Examples where necessary:**
84
-
85
- - "What color shirt is the person wearing at 2:35?"
86
- - "Count the number of cars visible in the video"
87
- - "Describe the visual style of the opening scene"
88
-
89
- **For GAIA #3 and #5:**
90
-
91
- - Both are content-based (species mentioned, dialogue)
92
- - Transcript is still fastest (1-3s vs 1-10 min total)
93
- - Frame extraction as fallback is viable (extraction is fast, but vision processing is slow)
94
-
95
- **Decision:** Transcript-first approach is correct. Frame extraction is viable fallback if transcript unavailable, but total time still 1-10 min due to download + vision API.
96
-
97
- ---
98
-
99
- ## Fallback Strategy
100
-
101
- **Scenario:** Video has no transcript available
102
-
103
- **Options:**
104
-
105
- 1. **Return error** → LLM treats as system_error, skips question ✅ Simple
106
- 2. **Download + extract frames** → Use vision tool (heavy, slow)
107
- 3. **Return metadata** (title, description) → LLM infers from context
108
- 4. **Chain approach:** Transcript → Metadata → Frames
109
-
110
- **Decision:** Start with audio-to-text fallback (Whisper on ZeroGPU) for higher success rate.
111
-
112
- ---
113
-
114
- ## Audio-to-Text Fallback: When No Transcript Available
115
-
116
- ### The Hierarchy
117
-
118
- ```
119
- YouTube URL
120
-
121
- ├─ Has transcript? ✅ → Use youtube-transcript-api (instant, 1-3 sec)
122
-
123
- └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
124
- ```
125
-
126
- ### Whisper Cost Analysis
127
-
128
- | Option | Cost | Speed | Verdict |
129
- | --------------- | ---------- | -------------- | ------------------ |
130
- | OpenAI API | $0.006/min | ⚡⚡⚡ Fastest | If budget OK |
131
- | **Open Source** | **FREE** | ⚡⚡ Fast | ⭐ **Recommended** |
132
- | HuggingFace | FREE | ⚡⚡ Fast | Good alternative |
133
-
134
- **Decision:** Open-source Whisper (free, no API limits, works offline)
135
-
136
- ---
137
-
138
- ### HF Hardware: ZeroGPU ✅
139
-
140
- | Resource | Available | Whisper Requirements | Verdict |
141
- | ---------- | ----------- | ------------------------- | --------------------------------- |
142
- | **CPU** | 4 vCPUs | 1+ cores | ✅ Plenty |
143
- | **Memory** | 16 GB RAM | 1-10 GB (model-dependent) | ✅ Comfortable |
144
- | **Disk** | 20 GB | ~150 MB - 1.5 GB | ✅ More than enough |
145
- | **GPU** | **ZeroGPU** | Optional (faster) | ✅ **Available via subscription** |
146
-
147
- **ZeroGPU Benefits:**
148
-
149
- - ✅ Dynamic GPU allocation (5-10x faster than CPU)
150
- - ✅ Can use larger models (`small`, `medium`) for better accuracy
151
- - ✅ Still free (subscription benefit)
152
-
153
- **ZeroGPU Requirement:**
154
-
155
- ⚠️ **Critical:** ZeroGPU requires `@spaces.GPU` decorator on at least one function.
156
-
157
- **Error without decorator:**
158
-
159
- ```
160
- runtime error: No @spaces.GPU function detected during startup
161
- ```
162
-
163
- **Solution:**
164
-
165
- ```python
166
- from spaces import GPU
167
-
168
- @spaces.GPU # Required for ZeroGPU
169
- def transcribe_audio(file_path: str) -> str:
170
- # Whisper code here
171
- pass
172
- ```
173
-
174
- **How it works:**
175
-
176
- - ZeroGPU scans codebase for `@spaces.GPU` decorator at startup
177
- - If found: Allocates GPU when function is called
178
- - If not found: Kills container immediately (no GPU work planned)
179
-
180
- ### Performance: CPU vs ZeroGPU
181
-
182
- | Model | On CPU | On ZeroGPU | Speedup |
183
- | -------- | --------- | ------------- | ------- |
184
- | `base` | 30-60 sec | **5-10 sec** | 5-10x |
185
- | `small` | 1-2 min | **10-20 sec** | 5-10x |
186
- | `medium` | 3-5 min | **20-40 sec** | 5-10x |
187
-
188
- **For 5-minute YouTube video on ZeroGPU:**
189
-
190
- - `base` model: ~5-10 seconds ⚡⚡⚡
191
- - `small` model: ~10-20 seconds ⚡⚡
192
-
193
- ### Recommended Model for ZeroGPU
194
-
195
- | Model | Size | Accuracy | Speed (ZeroGPU) | Recommendation |
196
- | -------- | ------ | --------- | --------------- | ---------------------- |
197
- | `tiny` | 39 MB | Lower | ~5 sec | Fastest, less accurate |
198
- | `base` | 74 MB | Good | ~10 sec | Good balance |
199
- | `small` | 244 MB | Better | ~20 sec | ⭐ **Recommended** |
200
- | `medium` | 769 MB | Very good | ~40 sec | If accuracy critical |
201
-
202
- **Choice:** `small` model - best accuracy/speed balance on ZeroGPU
203
-
204
- ### Implementation: Audio-to-Text Fallback
205
-
206
- ```python
207
- import whisper
208
- from spaces import GPU # Required for ZeroGPU
209
-
210
- _MODEL = None # Cache model globally
211
-
212
- @spaces.GPU # Required: ZeroGPU detects this decorator at startup
213
- def transcribe_audio(file_path: str) -> str:
214
- """Transcribe audio file using Whisper (ZeroGPU)."""
215
- global _MODEL
216
- try:
217
- if _MODEL is None:
218
- # ZeroGPU auto-detects GPU, no manual device specification
219
- _MODEL = whisper.load_model("small")
220
-
221
- result = _MODEL.transcribe(file_path)
222
- return result["text"]
223
- except Exception as e:
224
- return f"ERROR: Transcription failed: {e}"
225
- ```
226
-
227
- ---
228
-
229
- ### Unified Architecture: Phase 1 + Phase 2
230
-
231
- ```
232
- ┌─────────────────────────────────────────────────────────┐
233
- │ Audio Transcription │
234
- │ (transcribe_audio function) │
235
- │ Uses Whisper │
236
- │ on ZeroGPU │
237
- └─────────────────────────────────────────────────────────┘
238
-
239
-
240
- ┌───────────────────┴───────────────────┐
241
- │ │
242
- Phase 1 Phase 2
243
- YouTube URLs MP3 Files
244
- │ │
245
- │ 1. Try youtube-transcript-api │
246
- │ 2. Fallback: download audio only │
247
- │ 3. Call transcribe_audio() │
248
- │ │
249
- └───────────────────┬───────────────────┘
250
-
251
- Clean transcript
252
-
253
-
254
- LLM analyzes
255
- ```
256
-
257
- **Benefits:**
258
-
259
- - Single audio processing codebase
260
- - `transcribe_audio()` works for both phases
261
- - Tested on HF ZeroGPU hardware
262
- - Higher success rate than skip-only approach
263
-
264
- ---
265
-
266
- ## Tool Design - LLM Integration
267
-
268
- **Current problem:** Vision tool tries to process YouTube URL → fails
269
-
270
- **Proposed tool description:**
271
-
272
- ```
273
- "Extract transcript from YouTube video URL. Use when question asks about
274
- YouTube video content like: dialogue, speech, bird species identification,
275
- character quotes, or any content discussed in the video. Input: YouTube URL.
276
- Returns: Full transcript text or error message if transcript unavailable."
277
- ```
278
-
279
- **Alternative: Special URL handling in `parse_file()`**
280
-
281
- - Detect YouTube URLs
282
- - Return tool suggestion: "This is a YouTube URL. Consider using youtube_transcript tool."
283
-
284
- ---
285
-
286
- ## Implementation Considerations
287
-
288
- ### A. Video ID Extraction
289
-
290
- Handle various YouTube URL formats:
291
-
292
- - `youtube.com/watch?v=VIDEO_ID`
293
- - `youtu.be/VIDEO_ID`
294
- - `youtube.com/shorts/VIDEO_ID`
295
-
296
- ### B. Language Handling
297
-
298
- - GAIA questions are English → likely English transcripts
299
- - Question: Should we auto-translate or let LLM handle?
300
-
301
- ### C. Transcript Format
302
-
303
- - Raw JSON with timestamps vs clean text
304
- - LLM prefers clean text without timestamps
305
- - Question: Preserve timestamps for context?
306
-
307
- ### D. Error Types
308
-
309
- - No transcript available
310
- - Video private/deleted
311
- - Rate limiting
312
- - Regional restriction
313
-
314
- ---
315
-
316
- ## Testing Strategy
317
-
318
- **Before full evaluation:**
319
-
320
- 1. **Unit test** - Test on actual GAIA YouTube URLs
321
- 2. **Manual test** - Run single question (#3) to verify LLM uses tool correctly
322
- 3. **Integration test** - Verify transcript → answer pipeline
323
-
324
- **Question:** Do we have access to actual YouTube URLs for pre-testing?
325
-
326
- ---
327
-
328
- ## Edge Cases
329
-
330
- | Scenario | Handling |
331
- | --------------------------------- | --------------------------------- |
332
- | Multiple transcript languages | Pick English or first available |
333
- | Auto-generated transcript | Accept (less accurate but usable) |
334
- | YouTube Shorts format | Extract VIDEO_ID from shorts URL |
335
- | Segmented transcript (by speaker) | Clean to plain text |
336
-
337
- ---
338
-
339
- ## Recommendations
340
-
341
- 1. **Start simple:** youtube-transcript-api with clear error messages
342
- 2. **Fail gracefully:** If no transcript, return structured error → system_error=yes
343
- 3. **Tool description:** Emphasize "YouTube video content" for LLM selection
344
- 4. **Manual test first:** Verify on question #3 before full evaluation
345
- 5. **Success metric:** Both questions correct → 40% score ✅ TARGET REACHED
346
-
347
- ---
348
-
349
- ## Open Questions
350
-
351
- - [ ] Implement fallback to frame extraction if transcript fails?
352
- - [ ] Add special YouTube URL detection in `parse_file()`?
353
- - [ ] Access to actual YouTube URLs for pre-testing?
354
- - [ ] Simple first vs comprehensive solution?
355
-
356
- ---
357
-
358
- ## Files to Create
359
-
360
- - `src/tools/audio.py` - Whisper transcription with @spaces.GPU (unified Phase 1+2)
361
- - `src/tools/youtube.py` - YouTube transcript extraction with audio fallback
362
- - Update `src/tools/__init__.py` - Register youtube_transcript and transcribe_audio tools
363
- - Update `requirements.txt` - Add youtube-transcript-api, openai-whisper, yt-dlp
364
-
365
- ---
366
-
367
- ## Industry Validation ✅
368
-
369
- **Overall Assessment:** Approach validated and aligns with industry standards.
370
-
371
- ### Core Architecture Validation
372
-
373
- | Component | Our Approach | Industry Standard | Status |
374
- | ---------------- | -------------------------- | ------------------------------------------------- | ------------ |
375
- | Primary method | Transcript-first | youtube-transcript-api → Whisper fallback | ✅ Confirmed |
376
- | Library choice | youtube-transcript-api | Widely used (LangChain, CrewAI, 1K+ GitHub repos) | ✅ Standard |
377
- | Fallback method | Whisper on ZeroGPU | yt-dlp + Whisper (OpenAI API or self-hosted) | ✅ Optimal |
378
- | Frame extraction | Skip for content questions | Only for visual queries | ✅ Validated |
379
-
380
- ### Key Findings
381
-
382
- **Transcript-First Approach:**
383
-
384
- - LangChain's YoutubeLoader uses youtube-transcript-api as primary
385
- - CrewAI demonstrates YouTube transcript → Gemini LLM workflow
386
- - 92% of English tech videos have auto-captions available
387
- - Industry standard: transcript → LLM pattern
388
-
389
- **Frame Extraction Performance:**
390
-
391
- - ffmpeg decodes at 30-100x realtime speed
392
- - 10-min video extracts in 5-20 seconds (CPU) ✅ Confirmed
393
- - Bottleneck is vision API calls, not extraction ✅ Confirmed
394
-
395
- **Vision Processing Costs:**
396
- | Model | Cost per 600 frames (10-min video) |
397
- |-------|-----------------------------------|
398
- | GPT-4o | $1.80-3.60 |
399
- | Claude 3.5 | $2.16 |
400
- | Gemini 2.5 Flash | $23.40 |
401
-
402
- **Whisper Fallback:**
403
-
404
- - Industry standard: yt-dlp for audio → Whisper transcription
405
- - ZeroGPU approach is optimal for HF environment
406
- - Benchmark: Whisper.cpp transcribes 10-min clips in <90 seconds on M2 MacBook (CPU)
407
- - ZeroGPU with H200: 5-20 seconds for `small` model ✅ Estimate correct
408
-
409
- ### Industry Pattern
410
-
411
- **Standard workflow (validated):**
412
-
413
- 1. Try native transcript API (fast, free)
414
- 2. Fallback to audio transcription (Whisper)
415
- 3. Frame extraction only for visual-specific queries
416
- 4. Vision LLM last resort (expensive, slow)
417
-
418
- ### Real-World Implementations
419
-
420
- - **Alibaba:** 87 videos processed, Whisper.cpp averaged <90 seconds per 10-min clip
421
- - **Phantra (GitHub):** YouTube Transcript API → GPT-4o multi-agent system
422
- - **ytscript toolkit:** Transcript extraction → Claude/ChatGPT analysis
423
- - **Multiple RAG systems:** Transcript → embeddings → LLM Q&A
424
-
425
- ### Final Verdict
426
-
427
- ✅ Library choices validated
428
- ✅ Cost analysis accurate
429
- ✅ Performance estimates correct
430
- ✅ Architecture follows best practices
431
- ✅ ZeroGPU setup appropriate
432
-
433
- **No changes needed. Proceed with implementation.**
434
-
435
- ---
436
-
437
- ## Next Steps (Discussion → Implementation)
438
-
439
- 1. [x] Confirm approach based on video processing research ✅
440
- 2. [ ] Install youtube-transcript-api and openai-whisper
441
- 3. [ ] Create audio.py with @spaces.GPU decorator (unified Phase 1+2)
442
- 4. [ ] Create youtube.py with transcript extraction + audio fallback
443
- 5. [ ] Add tools to TOOLS registry
444
- 6. [ ] Manual test on question #3
445
- 7. [ ] Full evaluation
446
- 8. [ ] Verify 40% score (4/20 correct)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pyproject.toml CHANGED
@@ -38,6 +38,9 @@ dependencies = [
38
  "tenacity>=9.1.2",
39
  "datasets>=4.4.0",
40
  "groq>=1.0.0",
 
 
 
41
  ]
42
 
43
  [tool.uv]
 
38
  "tenacity>=9.1.2",
39
  "datasets>=4.4.0",
40
  "groq>=1.0.0",
41
+ "opencv-python>=4.12.0.88",
42
+ "ipykernel>=7.1.0",
43
+ "pip>=25.3",
44
  ]
45
 
46
  [tool.uv]
requirements.txt CHANGED
@@ -43,7 +43,8 @@ pillow>=10.4.0 # Image files (JPEG, PNG, etc.)
43
  # Audio/Video processing (Phase 1: YouTube support)
44
  youtube-transcript-api>=0.6.0 # YouTube transcript extraction
45
  openai-whisper>=20231117 # Audio transcription ( Whisper)
46
- yt-dlp>=2024.0.0 # Audio extraction from videos
 
47
 
48
  # ============================================================================
49
  # Existing Dependencies (from current app.py)
 
43
  # Audio/Video processing (Phase 1: YouTube support)
44
  youtube-transcript-api>=0.6.0 # YouTube transcript extraction
45
  openai-whisper>=20231117 # Audio transcription ( Whisper)
46
+ yt-dlp>=2024.0.0 # Audio/video extraction from YouTube
47
+ opencv-python>=4.8.0 # Frame extraction from video
48
 
49
  # ============================================================================
50
  # Existing Dependencies (from current app.py)
src/agent/llm_client.py CHANGED
@@ -60,6 +60,7 @@ logger = logging.getLogger(__name__)
60
  # ============================================================================
61
 
62
  _SESSION_LOG_FILE = None
 
63
 
64
 
65
  def get_session_log_file() -> Path:
@@ -78,25 +79,23 @@ def get_session_log_file() -> Path:
78
  log_dir = Path("_log")
79
  log_dir.mkdir(exist_ok=True)
80
 
81
- # Create session filename with timestamp
82
  timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
83
- _SESSION_LOG_FILE = log_dir / f"llm_session_{timestamp}.txt"
84
 
85
- # Write session header
86
  with open(_SESSION_LOG_FILE, "w", encoding="utf-8") as f:
87
- f.write("=" * 80 + "\n")
88
- f.write("LLM SYNTHESIS SESSION LOG\n")
89
- f.write("=" * 80 + "\n")
90
- f.write(f"Session Start: {datetime.datetime.now().isoformat()}\n")
91
- f.write("=" * 80 + "\n\n")
92
 
93
  return _SESSION_LOG_FILE
94
 
95
 
96
  def reset_session_log():
97
  """Reset session log file (for testing or new evaluation run)."""
98
- global _SESSION_LOG_FILE
99
  _SESSION_LOG_FILE = None
 
100
 
101
 
102
  # ============================================================================
@@ -1124,6 +1123,8 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
1124
 
1125
  def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
1126
  """Synthesize factoid answer from evidence using HuggingFace Inference API."""
 
 
1127
  client = create_hf_client()
1128
 
1129
  # Format evidence
@@ -1166,32 +1167,37 @@ FINAL ANSWER: 3
1166
  Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
1167
 
1168
  # ============================================================================
1169
- # SAVE LLM CONTEXT TO SESSION LOG - Single file per evaluation run
1170
  # ============================================================================
1171
  context_file = get_session_log_file()
 
1172
 
1173
- with open(context_file, "a", encoding="utf-8") as f:
1174
- f.write("\n" + "=" * 80 + "\n")
1175
- f.write("QUESTION START\n")
1176
- f.write("=" * 80 + "\n")
1177
- f.write(f"Timestamp: {datetime.datetime.now().isoformat()}\n")
1178
- f.write(f"Question: {question}\n")
1179
- f.write(f"Evidence items: {len(evidence)}\n")
1180
- f.write("\n" + "=" * 80 + "\n")
1181
- f.write("SYSTEM PROMPT:\n")
1182
- f.write("=" * 80 + "\n")
1183
- f.write(system_prompt)
1184
- f.write("\n" + "=" * 80 + "\n")
1185
- f.write("USER PROMPT:\n")
1186
- f.write("=" * 80 + "\n")
1187
- f.write(user_prompt)
1188
- f.write("\n" + "=" * 80 + "\n")
1189
- f.write("EVIDENCE ITEMS:\n")
1190
- f.write("=" * 80 + "\n")
1191
- for i, ev in enumerate(evidence):
1192
- f.write(f"\n--- Evidence {i+1}/{len(evidence)} ---\n")
1193
- f.write(ev)
1194
- f.write("\n" + "=" * 80 + "\n")
 
 
 
 
1195
 
1196
  messages = [
1197
  {"role": "system", "content": system_prompt},
@@ -1218,17 +1224,23 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
1218
 
1219
  logger.info(f"[synthesize_answer_hf] Answer: {answer}")
1220
 
1221
- # Append LLM response to session log (includes reasoning)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1222
  with open(context_file, "a", encoding="utf-8") as f:
1223
- f.write("\n" + "=" * 80 + "\n")
1224
- f.write("LLM RESPONSE (with reasoning):\n")
1225
- f.write("=" * 80 + "\n")
1226
- f.write(full_response)
1227
- f.write("\n" + "=" * 80 + "\n")
1228
- f.write(f"\nEXTRACTED FINAL ANSWER: {answer}\n")
1229
- f.write("=" * 80 + "\n")
1230
- f.write("QUESTION END\n")
1231
- f.write("=" * 80 + "\n")
1232
 
1233
  return answer
1234
 
 
60
  # ============================================================================
61
 
62
  _SESSION_LOG_FILE = None
63
+ _SYSTEM_PROMPT_WRITTEN = False
64
 
65
 
66
  def get_session_log_file() -> Path:
 
79
  log_dir = Path("_log")
80
  log_dir.mkdir(exist_ok=True)
81
 
82
+ # Create session filename with timestamp (use .md for Markdown)
83
  timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
84
+ _SESSION_LOG_FILE = log_dir / f"llm_session_{timestamp}.md"
85
 
86
+ # Write session header in Markdown
87
  with open(_SESSION_LOG_FILE, "w", encoding="utf-8") as f:
88
+ f.write("# LLM Synthesis Session Log\n\n")
89
+ f.write(f"**Session Start:** {datetime.datetime.now().isoformat()}\n\n")
 
 
 
90
 
91
  return _SESSION_LOG_FILE
92
 
93
 
94
  def reset_session_log():
95
  """Reset session log file (for testing or new evaluation run)."""
96
+ global _SESSION_LOG_FILE, _SYSTEM_PROMPT_WRITTEN
97
  _SESSION_LOG_FILE = None
98
+ _SYSTEM_PROMPT_WRITTEN = False
99
 
100
 
101
  # ============================================================================
 
1123
 
1124
  def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
1125
  """Synthesize factoid answer from evidence using HuggingFace Inference API."""
1126
+ global _SYSTEM_PROMPT_WRITTEN
1127
+
1128
  client = create_hf_client()
1129
 
1130
  # Format evidence
 
1167
  Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
1168
 
1169
  # ============================================================================
1170
+ # BUFFER QUESTION CONTEXT - Write complete block atomically after response
1171
  # ============================================================================
1172
  context_file = get_session_log_file()
1173
+ question_timestamp = datetime.datetime.now().isoformat()
1174
 
1175
+ # Build question header (include system prompt only on first question)
1176
+ system_prompt_section = ""
1177
+ if not _SYSTEM_PROMPT_WRITTEN:
1178
+ system_prompt_section = f"""
1179
+
1180
+ ## System Prompt (static - used for all questions)
1181
+
1182
+ ```text
1183
+ {system_prompt}
1184
+ ```
1185
+ """
1186
+ _SYSTEM_PROMPT_WRITTEN = True
1187
+
1188
+ question_header = f"""
1189
+ ## Question [{question_timestamp}]
1190
+
1191
+ **Question:** {question}
1192
+ **Evidence items:** {len(evidence)}
1193
+ {system_prompt_section}
1194
+
1195
+ ### Evidence & Prompt
1196
+
1197
+ ```text
1198
+ {user_prompt}
1199
+ ```
1200
+ """
1201
 
1202
  messages = [
1203
  {"role": "system", "content": system_prompt},
 
1224
 
1225
  logger.info(f"[synthesize_answer_hf] Answer: {answer}")
1226
 
1227
+ # ============================================================================
1228
+ # WRITE COMPLETE QUESTION BLOCK ATOMICALLY (header + response + end)
1229
+ # ============================================================================
1230
+ complete_block = f"""{question_header}
1231
+
1232
+ ### LLM Response
1233
+
1234
+ ```text
1235
+ {full_response}
1236
+ ```
1237
+
1238
+ **Extracted Answer:** `{answer}`
1239
+
1240
+ """
1241
+
1242
  with open(context_file, "a", encoding="utf-8") as f:
1243
+ f.write(complete_block)
 
 
 
 
 
 
 
 
1244
 
1245
  return answer
1246
 
src/tools/__init__.py CHANGED
@@ -82,7 +82,7 @@ TOOLS = {
82
  },
83
  "youtube_transcript": {
84
  "function": youtube_transcript,
85
- "description": "Extract transcript from YouTube video URLs (youtube.com, youtu.be, shorts). Use this tool FIRST when question mentions YouTube, video, or contains a YouTube URL. This tool handles video content by extracting the transcript (what is said/discussed in the video). Falls back to Whisper audio transcription if captions are unavailable. This is the ONLY tool that can process YouTube URLs directly.",
86
  "parameters": {
87
  "url": {
88
  "description": "YouTube video URL (youtube.com/watch?v=ID, youtu.be/ID, or shorts/ID format)",
 
82
  },
83
  "youtube_transcript": {
84
  "function": youtube_transcript,
85
+ "description": "Extract transcript from YouTube video URLs (youtube.com, youtu.be, shorts) OR analyze video frames visually. Use this tool FIRST when question mentions YouTube, video, or contains a YouTube URL. This tool handles video content in two modes: (1) Transcript mode extracts what is said/discussed via captions or Whisper fallback, (2) Frame mode extracts and analyzes video frames with vision models. Mode is controlled by YOUTUBE_MODE env variable. This is the ONLY tool that can process YouTube URLs directly.",
86
  "parameters": {
87
  "url": {
88
  "description": "YouTube video URL (youtube.com/watch?v=ID, youtu.be/ID, or shorts/ID format)",
src/tools/youtube.py CHANGED
@@ -1,23 +1,29 @@
1
  """
2
- YouTube Transcript Tool - Extract transcripts from YouTube videos
3
  Author: @mangobee
4
  Date: 2026-01-13
5
 
6
- Provides YouTube video transcript extraction:
7
- - Primary: youtube-transcript-api (instant, 1-3 seconds)
8
- - Fallback: yt-dlp audio extraction + Whisper transcription (30s-2min)
9
- - Handles various YouTube URL formats (watch, youtu.be, shorts)
10
- - Returns clean transcript text for LLM analysis
11
 
12
- Workflow:
13
  YouTube URL
14
  ├─ Has transcript? ✅ → Use youtube-transcript-api (instant)
15
  └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
16
 
 
 
 
 
 
 
17
  Requirements:
18
  - youtube-transcript-api: pip install youtube-transcript-api
19
  - yt-dlp: pip install yt-dlp
20
- - openai-whisper: pip install openai-whisper (via src.tools.audio)
 
 
21
  """
22
 
23
  import logging
@@ -39,6 +45,10 @@ YOUTUBE_PATTERNS = [
39
  AUDIO_FORMAT = "mp3"
40
  AUDIO_QUALITY = "128" # 128 kbps (sufficient for speech)
41
 
 
 
 
 
42
  # Temporary file cleanup
43
  CLEANUP_TEMP_FILES = True
44
 
@@ -54,7 +64,7 @@ logger = logging.getLogger(__name__)
54
 
55
  def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
56
  """
57
- Save transcript to log/ folder for debugging.
58
 
59
  Args:
60
  video_id: YouTube video ID
@@ -65,14 +75,15 @@ def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
65
  log_dir = Path("_log")
66
  log_dir.mkdir(exist_ok=True)
67
 
68
- cache_file = log_dir / f"{video_id}_transcript.txt"
69
  with open(cache_file, "w", encoding="utf-8") as f:
70
- f.write(f"# YouTube Transcript\n")
71
- f.write(f"# Video ID: {video_id}\n")
72
- f.write(f"# Source: {source}\n")
73
- f.write(f"# Length: {len(text)} characters\n")
74
- f.write(f"# Generated: {__import__('datetime').datetime.now().isoformat()}\n")
75
- f.write(f"\n{text}\n")
 
76
 
77
  logger.info(f"Transcript saved: {cache_file}")
78
  except Exception as e:
@@ -343,35 +354,329 @@ def transcribe_from_audio(video_url: str) -> Dict[str, Any]:
343
  }
344
 
345
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
346
  # ============================================================================
347
  # Main API Function
348
  # =============================================================================
349
 
350
- def youtube_transcript(url: str) -> Dict[str, Any]:
351
  """
352
- Extract transcript from YouTube video.
353
 
354
- Primary method: youtube-transcript-api (instant)
355
- Fallback method: Download audio + Whisper transcription (slower)
356
 
357
  Args:
358
  url: YouTube video URL (youtube.com, youtu.be, shorts)
 
359
 
360
  Returns:
361
  Dict with structure: {
362
- "text": str, # Transcript text
363
  "video_id": str, # Video ID
364
- "source": str, # "api" or "whisper"
365
- "success": bool, # True if transcription succeeded
366
  "error": str or None # Error message if failed
 
367
  }
368
 
369
  Raises:
370
- ValueError: If URL is not a valid YouTube URL
371
 
372
  Examples:
373
- >>> youtube_transcript("https://youtube.com/watch?v=dQw4w9WgXcQ")
374
  {"text": "Never gonna give you up...", "video_id": "dQw4w9WgXcQ", "source": "api", "success": True, "error": None}
 
 
 
375
  """
376
  # Validate URL and extract video ID
377
  video_id = extract_video_id(url)
@@ -386,26 +691,71 @@ def youtube_transcript(url: str) -> Dict[str, Any]:
386
  "error": f"Invalid YouTube URL: {url}"
387
  }
388
 
389
- logger.info(f"Processing YouTube video: {video_id}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
390
 
391
- # Try transcript API first (fast)
392
- result = get_youtube_transcript(video_id)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393
 
394
- if result["success"]:
395
- logger.info(f"Transcript retrieved via API: {len(result['text'])} characters")
396
- # Log transcript to file for debugging
397
- logger.info(f"Transcript content: {result['text'][:200]}...")
398
  return result
399
 
400
- # Fallback to audio transcription (slow but works)
401
- logger.info(f"Transcript API failed, trying audio transcription...")
402
- result = transcribe_from_audio(url)
403
 
404
- if result["success"]:
405
- logger.info(f"Transcript retrieved via Whisper: {len(result['text'])} characters")
406
- # Log full transcript for debugging
407
- logger.info(f"Full transcript: {result['text']}")
408
- else:
409
- logger.error(f"All transcript methods failed for video: {video_id}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
410
 
411
- return result
 
1
  """
2
+ YouTube Video Analysis Tool - Extract transcripts or analyze frames from YouTube videos
3
  Author: @mangobee
4
  Date: 2026-01-13
5
 
6
+ Provides two modes for YouTube video analysis:
7
+ - Transcript Mode: youtube-transcript-api (instant, 1-3 seconds) or Whisper fallback
8
+ - Frame Mode: Extract video frames and analyze with vision models
 
 
9
 
10
+ Transcript Mode Workflow:
11
  YouTube URL
12
  ├─ Has transcript? ✅ → Use youtube-transcript-api (instant)
13
  └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
14
 
15
+ Frame Mode Workflow:
16
+ YouTube URL
17
+ ├─ Download video with yt-dlp
18
+ ├─ Extract N frames at regular intervals
19
+ └─ Analyze frames with vision models (summarize findings)
20
+
21
  Requirements:
22
  - youtube-transcript-api: pip install youtube-transcript-api
23
  - yt-dlp: pip install yt-dlp
24
+ - openai: pip install openai (via src.tools.audio)
25
+ - opencv-python: pip install opencv-python (for frame extraction)
26
+ - PIL: pip install Pillow (for image handling)
27
  """
28
 
29
  import logging
 
45
  AUDIO_FORMAT = "mp3"
46
  AUDIO_QUALITY = "128" # 128 kbps (sufficient for speech)
47
 
48
+ # Frame extraction settings
49
+ FRAME_COUNT = 6 # Number of frames to extract
50
+ FRAME_QUALITY = "worst" # YouTube-dl format quality for frame extraction (worst = faster download)
51
+
52
  # Temporary file cleanup
53
  CLEANUP_TEMP_FILES = True
54
 
 
64
 
65
  def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
66
  """
67
+ Save transcript to _log/ folder for debugging.
68
 
69
  Args:
70
  video_id: YouTube video ID
 
75
  log_dir = Path("_log")
76
  log_dir.mkdir(exist_ok=True)
77
 
78
+ cache_file = log_dir / f"{video_id}_transcript.md"
79
  with open(cache_file, "w", encoding="utf-8") as f:
80
+ f.write(f"# YouTube Transcript\n\n")
81
+ f.write(f"**Video ID:** {video_id}\n")
82
+ f.write(f"**Source:** {source}\n")
83
+ f.write(f"**Length:** {len(text)} characters\n")
84
+ f.write(f"**Generated:** {__import__('datetime').datetime.now().isoformat()}\n\n")
85
+ f.write(f"## Transcript\n\n")
86
+ f.write(f"{text}\n")
87
 
88
  logger.info(f"Transcript saved: {cache_file}")
89
  except Exception as e:
 
354
  }
355
 
356
 
357
+ # ============================================================================
358
+ # Frame Processing (Video Analysis Mode)
359
+ # =============================================================================
360
+
361
+ def download_video(url: str) -> Optional[str]:
362
+ """
363
+ Download video from YouTube using yt-dlp for frame extraction.
364
+
365
+ Args:
366
+ url: Full YouTube URL
367
+
368
+ Returns:
369
+ Path to downloaded video file or None if failed
370
+ """
371
+ try:
372
+ import yt_dlp
373
+
374
+ logger.info(f"Downloading video from: {url}")
375
+
376
+ # Create temp file for video
377
+ temp_dir = tempfile.gettempdir()
378
+ output_path = os.path.join(temp_dir, f"youtube_video_{os.getpid()}")
379
+
380
+ # yt-dlp options: video only, lowest quality (faster for frame extraction)
381
+ ydl_opts = {
382
+ 'format': f'best[ext=mp4]/best',
383
+ 'outtmpl': output_path,
384
+ 'quiet': True,
385
+ 'no_warnings': True,
386
+ }
387
+
388
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
389
+ ydl.download([url])
390
+
391
+ # Find the downloaded file (yt-dlp adds extension)
392
+ for file in os.listdir(temp_dir):
393
+ if file.startswith(f"youtube_video_{os.getpid()}"):
394
+ actual_path = os.path.join(temp_dir, file)
395
+ size_mb = os.path.getsize(actual_path) / (1024 * 1024)
396
+ logger.info(f"Video downloaded: {actual_path} ({size_mb:.2f}MB)")
397
+ return actual_path
398
+
399
+ logger.error("Video file not found after download")
400
+ return None
401
+
402
+ except ImportError:
403
+ logger.error("yt-dlp not installed. Run: pip install yt-dlp")
404
+ return None
405
+ except Exception as e:
406
+ logger.error(f"Video download failed: {e}")
407
+ return None
408
+
409
+
410
+ def extract_frames(video_path: str, count: int = FRAME_COUNT) -> list:
411
+ """
412
+ Extract frames from video at regular intervals.
413
+
414
+ Args:
415
+ video_path: Path to video file
416
+ count: Number of frames to extract (default: FRAME_COUNT)
417
+
418
+ Returns:
419
+ List of (frame_path, timestamp) tuples
420
+ """
421
+ try:
422
+ import cv2
423
+
424
+ cap = cv2.VideoCapture(video_path)
425
+ total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
426
+ fps = cap.get(cv2.CAP_PROP_FPS)
427
+ duration = total_frames / fps if fps > 0 else 0
428
+
429
+ logger.info(f"Video: {total_frames} frames, {fps:.2f} FPS, {duration:.2f}s duration")
430
+
431
+ # Calculate frame indices at regular intervals
432
+ if total_frames <= count:
433
+ frame_indices = list(range(total_frames))
434
+ else:
435
+ interval = total_frames / count
436
+ frame_indices = [int(i * interval) for i in range(count)]
437
+
438
+ logger.info(f"Extracting {len(frame_indices)} frames at indices: {frame_indices[:3]}...")
439
+
440
+ frames = []
441
+ temp_dir = tempfile.gettempdir()
442
+
443
+ for idx, frame_idx in enumerate(frame_indices):
444
+ cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
445
+ ret, frame = cap.read()
446
+
447
+ if ret:
448
+ timestamp = frame_idx / fps if fps > 0 else 0
449
+ frame_path = os.path.join(temp_dir, f"frame_{os.getpid()}_{idx}.jpg")
450
+ cv2.imwrite(frame_path, frame)
451
+ frames.append((frame_path, timestamp))
452
+ logger.debug(f"Frame {idx}: {timestamp:.2f}s -> {frame_path}")
453
+ else:
454
+ logger.warning(f"Failed to extract frame at index {frame_idx}")
455
+
456
+ cap.release()
457
+ logger.info(f"Extracted {len(frames)} frames")
458
+ return frames
459
+
460
+ except ImportError:
461
+ logger.error("opencv-python not installed. Run: pip install opencv-python")
462
+ return []
463
+ except Exception as e:
464
+ logger.error(f"Frame extraction failed: {e}")
465
+ return []
466
+
467
+
468
+ def analyze_frames(frames: list, question: str = None) -> Dict[str, Any]:
469
+ """
470
+ Analyze video frames using vision models.
471
+
472
+ Args:
473
+ frames: List of (frame_path, timestamp) tuples
474
+ question: Optional question to ask about frames
475
+
476
+ Returns:
477
+ Dict with structure: {
478
+ "text": str, # Summarized analysis
479
+ "video_id": str, # Video ID (placeholder)
480
+ "source": str, # "frames"
481
+ "success": bool, # True if analysis succeeded
482
+ "error": str or None # Error message if failed
483
+ "frame_count": int, # Number of frames analyzed
484
+ }
485
+ """
486
+ from src.tools.vision import analyze_image
487
+
488
+ if not frames:
489
+ return {
490
+ "text": "",
491
+ "video_id": "",
492
+ "source": "frames",
493
+ "success": False,
494
+ "error": "No frames to analyze",
495
+ "frame_count": 0,
496
+ }
497
+
498
+ # Default question for frame analysis
499
+ if not question:
500
+ question = "Describe what you see in this frame. Include any visible text, objects, people, or actions."
501
+
502
+ try:
503
+ logger.info(f"Analyzing {len(frames)} frames with vision model...")
504
+
505
+ frame_analyses = []
506
+
507
+ for idx, (frame_path, timestamp) in enumerate(frames):
508
+ logger.info(f"Analyzing frame {idx + 1}/{len(frames)} at {timestamp:.2f}s...")
509
+
510
+ # Customize question with timestamp context
511
+ frame_question = f"This is frame {idx + 1} of {len(frames)} from a video at timestamp {timestamp:.2f} seconds. {question}"
512
+
513
+ try:
514
+ result = analyze_image(frame_path, frame_question)
515
+ answer = result.get("answer", "")
516
+
517
+ # Add timestamp context
518
+ frame_analyses.append(f"[Frame {idx + 1} @ {timestamp:.2f}s]\n{answer}")
519
+
520
+ logger.info(f"Frame {idx + 1} analyzed: {len(answer)} chars")
521
+
522
+ except Exception as e:
523
+ logger.warning(f"Frame {idx + 1} analysis failed: {e}")
524
+ frame_analyses.append(f"[Frame {idx + 1} @ {timestamp:.2f}s]\nAnalysis failed: {str(e)}")
525
+
526
+ # Cleanup frame files
527
+ if CLEANUP_TEMP_FILES:
528
+ for frame_path, _ in frames:
529
+ try:
530
+ os.remove(frame_path)
531
+ except Exception as e:
532
+ logger.warning(f"Failed to cleanup frame {frame_path}: {e}")
533
+
534
+ # Combine all frame analyses
535
+ combined_text = "\n\n".join(frame_analyses)
536
+
537
+ logger.info(f"Frame analysis complete: {len(combined_text)} chars total")
538
+
539
+ return {
540
+ "text": combined_text,
541
+ "video_id": "",
542
+ "source": "frames",
543
+ "success": True,
544
+ "error": None,
545
+ "frame_count": len(frames),
546
+ }
547
+
548
+ except Exception as e:
549
+ logger.error(f"Frame analysis failed: {e}")
550
+ return {
551
+ "text": "",
552
+ "video_id": "",
553
+ "source": "frames",
554
+ "success": False,
555
+ "error": f"Frame analysis failed: {str(e)}",
556
+ "frame_count": len(frames),
557
+ }
558
+
559
+
560
+ def process_video_frames(url: str, question: str = None, frame_count: int = FRAME_COUNT) -> Dict[str, Any]:
561
+ """
562
+ Download video, extract frames, and analyze with vision models.
563
+
564
+ Args:
565
+ url: Full YouTube URL
566
+ question: Optional question to ask about frames
567
+ frame_count: Number of frames to extract
568
+
569
+ Returns:
570
+ Dict with structure: {
571
+ "text": str, # Combined frame analyses
572
+ "video_id": str, # Video ID
573
+ "source": str, # "frames"
574
+ "success": bool, # True if processing succeeded
575
+ "error": str or None # Error message if failed
576
+ "frame_count": int # Number of frames analyzed
577
+ }
578
+ """
579
+ video_id = extract_video_id(url)
580
+
581
+ if not video_id:
582
+ return {
583
+ "text": "",
584
+ "video_id": "",
585
+ "source": "frames",
586
+ "success": False,
587
+ "error": "Invalid YouTube URL",
588
+ "frame_count": 0,
589
+ }
590
+
591
+ # Download video
592
+ video_file = download_video(url)
593
+
594
+ if not video_file:
595
+ return {
596
+ "text": "",
597
+ "video_id": video_id,
598
+ "source": "frames",
599
+ "success": False,
600
+ "error": "Failed to download video",
601
+ "frame_count": 0,
602
+ }
603
+
604
+ try:
605
+ # Extract frames
606
+ frames = extract_frames(video_file, frame_count)
607
+
608
+ if not frames:
609
+ return {
610
+ "text": "",
611
+ "video_id": video_id,
612
+ "source": "frames",
613
+ "success": False,
614
+ "error": "Failed to extract frames",
615
+ "frame_count": 0,
616
+ }
617
+
618
+ # Analyze frames
619
+ result = analyze_frames(frames, question)
620
+
621
+ # Cleanup temp video file
622
+ if CLEANUP_TEMP_FILES:
623
+ try:
624
+ os.remove(video_file)
625
+ logger.info(f"Cleaned up temp video: {video_file}")
626
+ except Exception as e:
627
+ logger.warning(f"Failed to cleanup temp video: {e}")
628
+
629
+ # Add video_id to result
630
+ result["video_id"] = video_id
631
+
632
+ return result
633
+
634
+ except Exception as e:
635
+ logger.error(f"Video frame processing failed: {e}")
636
+ return {
637
+ "text": "",
638
+ "video_id": video_id,
639
+ "source": "frames",
640
+ "success": False,
641
+ "error": f"Video processing failed: {str(e)}",
642
+ "frame_count": 0,
643
+ }
644
+
645
+
646
  # ============================================================================
647
  # Main API Function
648
  # =============================================================================
649
 
650
+ def youtube_analyze(url: str, mode: str = "transcript") -> Dict[str, Any]:
651
  """
652
+ Analyze YouTube video using transcript or frame processing mode.
653
 
654
+ Transcript Mode: Extract transcript (youtube-transcript-api or Whisper)
655
+ Frame Mode: Extract frames and analyze with vision models
656
 
657
  Args:
658
  url: YouTube video URL (youtube.com, youtu.be, shorts)
659
+ mode: Analysis mode - "transcript" (default) or "frames"
660
 
661
  Returns:
662
  Dict with structure: {
663
+ "text": str, # Transcript or frame analyses
664
  "video_id": str, # Video ID
665
+ "source": str, # "api", "whisper", or "frames"
666
+ "success": bool, # True if analysis succeeded
667
  "error": str or None # Error message if failed
668
+ "frame_count": int # Number of frames (frame mode only)
669
  }
670
 
671
  Raises:
672
+ ValueError: If URL is not valid or mode is invalid
673
 
674
  Examples:
675
+ >>> youtube_analyze("https://youtube.com/watch?v=dQw4w9WgXcQ", mode="transcript")
676
  {"text": "Never gonna give you up...", "video_id": "dQw4w9WgXcQ", "source": "api", "success": True, "error": None}
677
+
678
+ >>> youtube_analyze("https://youtube.com/watch?v=dQw4w9WgXcQ", mode="frames")
679
+ {"text": "[Frame 1 @ 0.00s]\nA man...", "video_id": "dQw4w9WgXcQ", "source": "frames", "success": True, "frame_count": 6, "error": None}
680
  """
681
  # Validate URL and extract video ID
682
  video_id = extract_video_id(url)
 
691
  "error": f"Invalid YouTube URL: {url}"
692
  }
693
 
694
+ # Validate mode
695
+ mode = mode.lower()
696
+ if mode not in ("transcript", "frames"):
697
+ logger.error(f"Invalid mode: {mode}")
698
+ return {
699
+ "text": "",
700
+ "video_id": video_id,
701
+ "source": "none",
702
+ "success": False,
703
+ "error": f"Invalid mode: {mode}. Valid: transcript, frames"
704
+ }
705
+
706
+ logger.info(f"Processing YouTube video: {video_id} (mode: {mode})")
707
+
708
+ # Route to appropriate processing mode
709
+ if mode == "frames":
710
+ # Frame processing mode
711
+ result = process_video_frames(url)
712
+ if result["success"]:
713
+ logger.info(f"Frame analysis complete: {result.get('frame_count', 0)} frames, {len(result['text'])} chars")
714
+ return result
715
 
716
+ else: # mode == "transcript"
717
+ # Transcript mode: Try API first, fallback to Whisper
718
+ result = get_youtube_transcript(video_id)
719
+
720
+ if result["success"]:
721
+ logger.info(f"Transcript retrieved via API: {len(result['text'])} characters")
722
+ logger.info(f"Transcript content: {result['text'][:200]}...")
723
+ return result
724
+
725
+ # Fallback to audio transcription (slow but works)
726
+ logger.info(f"Transcript API failed, trying audio transcription...")
727
+ result = transcribe_from_audio(url)
728
+
729
+ if result["success"]:
730
+ logger.info(f"Transcript retrieved via Whisper: {len(result['text'])} characters")
731
+ logger.info(f"Full transcript: {result['text']}")
732
+ else:
733
+ logger.error(f"All transcript methods failed for video: {video_id}")
734
 
 
 
 
 
735
  return result
736
 
 
 
 
737
 
738
+ # Backward compatibility wrapper that respects YOUTUBE_MODE environment variable
739
+ def youtube_transcript(url: str) -> Dict[str, Any]:
740
+ """
741
+ Wrapper for youtube_analyze that respects YOUTUBE_MODE environment variable.
742
+
743
+ This allows the agent to switch between transcript and frame modes
744
+ without changing the function signature used in the graph.
745
+
746
+ Mode selection:
747
+ - YOUTUBE_MODE env variable (set by UI): "transcript" or "frames"
748
+ - Default: "transcript" (backward compatible)
749
+
750
+ Args:
751
+ url: YouTube video URL
752
+
753
+ Returns:
754
+ Dict with structure from youtube_analyze()
755
+ """
756
+ # Read mode from environment variable (set by app.py UI)
757
+ mode = os.getenv("YOUTUBE_MODE", "transcript").lower()
758
+
759
+ logger.info(f"youtube_transcript called with YOUTUBE_MODE={mode}")
760
 
761
+ return youtube_analyze(url, mode=mode)