agentbee

Sleeping

mangubee commited on Jan 14

Commit

f1b095a

1 Parent(s): 3dcf523

Enhance YouTube video processing with transcript and frame analysis modes

- Updated logging standard to use Markdown format for session logs.
- Modified `run_and_submit_all` to include a new `video_mode` parameter for selecting YouTube processing mode (Transcript or Frames).
- Removed obsolete brainstorming document for YouTube transcript support.
- Added OpenCV and other dependencies for frame extraction in `pyproject.toml` and `requirements.txt`.
- Refactored `llm_client.py` to log session details in Markdown format.
- Implemented `youtube.py` to support both transcript extraction and frame analysis, with appropriate logging and error handling.
- Updated tool descriptions to reflect new functionality for analyzing video frames.
- Added backward compatibility for the `youtube_transcript` function to respect the `YOUTUBE_MODE` environment variable.

Files changed (9) hide show

CHANGELOG.md +348 -1078
CLAUDE.md +37 -6
app.py +13 -0
brainstorming_phase1_youtube.md +0 -446
pyproject.toml +3 -0
requirements.txt +2 -1
src/agent/llm_client.py +54 -42
src/tools/__init__.py +1 -1
src/tools/youtube.py +392 -42

CHANGELOG.md CHANGED Viewed

@@ -1,1311 +1,581 @@
 # Session Changelog
-## [2026-01-13] [Infrastructure] [COMPLETED] 3-Tier Folder Naming Convention
-**Problem:** Previous rename used `_` prefix for both runtime folders AND user-only folders, creating ambiguity.
-**Solution:** Implemented 3-tier naming convention to clearly distinguish folder purposes.
-**3-Tier Convention:**
-1. **User-only** (`user_*` prefix) - Manual use, not app runtime:
-   - `user_input/` - User testing files, not app input
-   - `user_output/` - User downloads, not app output
-   - `user_dev/` - Dev records (manual documentation)
-   - `user_archive/` - Archived code/reference materials
-2. **Runtime/Internal** (`_` prefix) - App creates, temporary:
-   - `_cache/` - Runtime cache, served via app download
-   - `_log/` - Runtime logs, debugging
-3. **Application** (no prefix) - Permanent code:
-   - `src/`, `test/`, `docs/`, `ref/` - Application folders
-**Folders Renamed:**
-- `_input/` → `user_input/` (user testing files)
-- `_output/` → `user_output/` (user downloads)
-- `dev/` → `user_dev/` (dev records)
-- `archive/` → `user_archive/` (archived materials)
-**Folders Unchanged (correct tier):**
-- `_cache/`, `_log/` - Runtime ✓
-- `src/`, `test/`, `docs/`, `ref/` - Application ✓
-**Updated Files:**
-- **test/test_phase0_hf_vision_api.py** - `Path("_output")` → `Path("user_output")`
-- **.gitignore** - Updated folder references and comments
-**Git Status:**
-- Old folders removed from git tracking
-- New folders excluded by .gitignore
-- Existing files become untracked
-**Result:** Clear 3-tier structure: user_*, _*, and no prefix
----
-## [2026-01-13] [Infrastructure] [COMPLETED] Runtime Folder Naming Convention - Underscore Prefix
-**Problem:** Folders `log/`, `output/`, and `input/` didn't clearly indicate they were runtime-only storage, making it unclear which folders are internal vs permanent.
-**Solution:** Renamed all runtime-only folders to use `_` prefix, following Python convention for internal/private.
-**Folders Renamed:**
-- `log/` → `_log/` (runtime logs, debugging)
-- `output/` → `_output/` (runtime results, user downloads)
-- `input/` → `_input/` (user testing files, not app input)
-**Rationale:**
-- `_` prefix signals "internal, temporary, not part of public API"
-- Consistent with Python convention (`_private`, `__dunder__`)
-- Distinguishes runtime storage from permanent project folders
-- `_cache/` already followed this convention ✓
-**Updated Files:**
-- **src/agent/llm_client.py** - `Path("log")` → `Path("_log")`
-- **src/tools/youtube.py** - `Path("log")` → `Path("_log")`
-- **test/test_phase0_hf_vision_api.py** - `Path("output")` → `Path("_output")`
-- **.gitignore** - Updated folder references
-**Git Status:**
-- Old folders removed from git tracking
-- New folders excluded by .gitignore
-- Existing files in those folders become untracked
-**Result:** Clear separation between runtime storage (`_` prefix) and permanent project folders (no prefix)
----
-## [2026-01-13] [Infrastructure] [COMPLETED] Session Log Consolidation - Single File Per Run
-**Problem:** Each question created a separate log file (`llm_context_TIMESTAMP.txt`), polluting the log/ folder with 20+ files per evaluation.
-**Solution:** Implemented session-level log file - all questions append to single file per run.
-**Implementation:**
-1. **Added session log management** (`llm_client.py`)
-   - Module-level `_SESSION_LOG_FILE` variable
-   - `get_session_log_file()` - Creates/reuses session log file
-   - `reset_session_log()` - For testing/new runs
-2. **Changed log file naming**
-   - Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
-   - New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
-3. **Updated log format**
-   - Added session header with start time
-   - Each question wrapped in `QUESTION START` / `QUESTION END` markers
-   - All questions append to same file
 **Modified Files:**
-- **src/agent/llm_client.py** (~50 lines modified)
-  - Added session log management functions
-  - Updated `synthesize_answer_hf()` to use session log
-  - Added imports: `datetime`, `Path`
-**Result:** Single log file per evaluation instead of 20+ files
----
-## [2026-01-13] [Stage 1: YouTube Support] [MILESTONE] 30% Target Achieved!
-**Score:** 30% (6/20 correct) - **First time hitting course target! 🎉**
-**Phase 1 Impact - YouTube + Audio Support:**
-- **Before:** 10% (2/20 correct)
-- **After:** 30% (6/20 correct)
-- **Improvement:** +20% (+4 questions fixed)
-**Questions Fixed by Phase 1:**
-1. a1e91b78: YouTube bird species (3) ✓ - youtube_transcript + Whisper
-2. 9d191bce: YouTube Teal'c quote (Extremely) ✓ - youtube_transcript + Whisper
-3. 99c9cc74: Strawberry pie MP3 (ingredients) ✓ - transcribe_audio (Whisper)
-4. 1f975693: Calculus MP3 (page numbers) ✓ - transcribe_audio (Whisper)
-**Remaining Issues:**
-- 3 system errors (vision NoneType, .py execution, calculator)
-- 10 "Unable to answer" (search evidence extraction issues)
-**Next Priority:**
-- Fix system errors (vision tool, Python execution)
-- Improve search answer extraction
-- Consider Phase 2.5 improvements
----
-## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Chain of Thought for LLM Synthesis Debugging
-**Problem:** LLM returns "Unable to answer" with no reasoning. Can't debug why synthesis fails despite having complete transcript evidence.
-**Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
-**Response Format:**
-```
-REASONING: [Step-by-step thought process]
-- What information is in the evidence?
-- What is the question asking for?
-- How do you extract the answer?
-- Any ambiguities or uncertainties?
-FINAL ANSWER: [Factoid answer]
-```
 **Implementation:**
-1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
-   - Request two-part response: REASONING + FINAL ANSWER
-   - Clear examples showing expected format
-   - Instructions for handling insufficient evidence
-2. **Increased max_tokens** from 256 → 1024
-   - Accommodate longer reasoning text
-   - Allow space for both reasoning and answer
-3. **Added parsing logic** to extract FINAL ANSWER
-   - Split response on "FINAL ANSWER:" delimiter
-   - Return only answer to agent (short for UI)
-   - Save full response (with reasoning) to log file
-4. **Enhanced log file format** (log/llm_context_TIMESTAMP.txt)
-   - Full LLM response with reasoning
-   - Extracted final answer
-   - Clear separation markers
-**Modified Files:**
-- **src/agent/llm_client.py** (~50 lines modified)
-  - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
-  - Updated `synthesize_answer_groq()` - Same changes
-  - Updated `synthesize_answer_claude()` - Same changes
-**Result:** Can now inspect LLM's thought process in log files to debug synthesis failures
----
-## [2026-01-13] [Infrastructure] [COMPLETED] Logging Standard - Console + File Separation
-**Problem:** Logs were too verbose (14k-16k tokens), making debugging difficult and expensive.
-**Solution:** Separated console output (status workflow) from detailed logs (file-based).
-**Console Output (Compressed):**
-- Status updates: `[plan] ✓ 660 chars`, `[execute] 1 tool(s) selected`, `[answer] ✓ 3`
-- Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
-- Success/failure: `✓` for success, `✗` for failure
-- File exports: `Context saved to: log/llm_context_*.txt`
-**Log Files (log/ folder):**
-- `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
-- `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
-- Purpose: Post-run analysis, context preservation, debugging
-**Modified Files:**
-- **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
-- **src/agent/graph.py** (~50 lines → ~15 lines) - Compressed node logs, removed separators
-- **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
-- **src/tools/youtube.py** (2 lines) - Save transcripts to log/ folder
-- **CLAUDE.md** (+30 lines) - Document logging standard
-- **.gitignore** (+3 lines) - Exclude log/ folder
-**Global Rule Update (~/.claude/CLAUDE.md):**
-- Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
-- Removed "logs/" from prohibited folders list
-- Updated folder purposes table with log/ entry
-**Result:** 16k tokens → ~6.7k tokens (58% reduction)
-**Standard Structure:**
 ```
-##_ProjectName/
-├── archive/    # Previous solutions, references
-├── input/      # Raw datasets, config files
-├── output/     # Execution results (gitignored)
-├── log/        # Runtime logs, LLM context (gitignored)
-├── test/       # Test files, data, configs
-├── dev/        # Dev records, problem solved
 ```
----
-## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation
-**Discovery:** HuggingFace Provider Suffix Behavior - Auto-Routing is Bad Practice
-**Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
-**Test Result:**
-```python
-# Without provider - WORKS but uses HF default routing
-HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"  # ✅ Works, but...
-# Response: "Test successful."
-# With explicit provider - RECOMMENDED
-HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"  # ✅ Reliable
 ```
-**Why Auto-Routing is Bad Practice:**
-| Issue                         | Impact                                                          |
-| ----------------------------- | --------------------------------------------------------------- |
-| **Unpredictable performance** | Provider changes between runs (fast Cerebras → slow Together)   |
-| **Inconsistent latency**      | 2s one run, 20s next run (different provider selected)          |
-| **No cost control**           | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
-| **Debugging nightmare**       | Can't reproduce issues when provider is unknown                 |
-| **Silent failures**           | Provider might be down, HF retries with different one           |
-**Best Practice: ALWAYS specify provider**
-```python
-# BAD - Unreliable
-HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
-# GOOD - Explicit, predictable
-HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"
-HF_MODEL = "Qwen/Qwen2.5-72B-Instruct:cerebras"
-HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
 ```
-**Available Providers for Text Models:**
-- `:scaleway` - Fast, reliable (recommended for Llama)
-- `:cerebras` - Very fast (recommended for Qwen)
-- `:novita` - Fast, reputable
-- `:together` - Reliable
-- `:sambanova` - Fast but expensive
-**Action Taken:** Updated code to always use explicit `:provider` suffix
----
-## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
-**Model Changes:**
-1. Qwen 2.5 72B (no provider) → Failed synthesis ("Unable to answer")
-2. Llama 3.3 70B (Scaleway) → Failed synthesis
-3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
-**openai/gpt-oss-120b:**
-- OpenAI's 120B parameter open source model
-- Strong reasoning capability
-- Optimized for function calling and tool use
----
-## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation (Original)
-**Problem:** Qwen 2.5 72B fails synthesis despite having complete transcript evidence (738 chars).
-**Root Cause Analysis:**
-- Transcript contains all 3 species: "giant petrel", "emperor", "adelie" (Whisper error: "deli")
-- Qwen 2.5 cannot resolve transcription errors ("deli" → "adelie penguin")
-- Qwen 2.5 weak at entity extraction + counting from noisy text
-- Returns "Unable to answer" instead of reasoning through ambiguity
-**Transcript Quality Assessment:**
-- **NOT clear enough for current LLM** - requires:
-  1. Error tolerance ("deli" → "adelie")
-  2. World knowledge (Antarctic bird species)
-  3. Entity extraction from narrative text
-  4. Temporal reasoning ("simultaneously" = same scene)
-**Answer from transcript:** 3 species (giant petrel, emperor penguin, adelie penguin)
-**Solution:** Upgrade to Llama 3.3 70B Instruct (Scaleway provider)
-- Better reasoning and instruction following
-- Stronger entity extraction from noisy context
-- Better at handling transcription ambiguities
-**Modified Files:**
-- **src/agent/llm_client.py** (line 37) - Model: Qwen 2.5 → Llama 3.3 70B
----
-## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Transcript Caching for Debugging
-**Problem:** Transcription works (738 chars from Whisper) but LLM returns "Unable to answer". Need to inspect raw transcript to debug synthesis failure.
-**Solution:** Added `save_transcript_to_cache()` function to save transcripts to `_cache/{video_id}_transcript.txt` for both API and Whisper paths.
-**Modified Files:**
-- **src/tools/youtube.py** (+30 lines)
-  - Added `save_transcript_to_cache()` function (lines 55-79)
-  - Calls after successful API transcript retrieval (line 164)
-  - Calls after successful Whisper transcription (line 317)
-  - File format includes metadata: video_id, source, length, timestamp
-**File Format:**
 ```
-# YouTube Transcript
-# Video ID: L1vXCYZAYYM
-# Source: whisper
-# Length: 738 characters
-# Generated: 2026-01-13T02:27:...
-<transcript text>
 ```
-**Next Steps:**
-- Test on question #3 (bird species) - inspect cached transcript
-- Debug LLM synthesis failure if transcript contains correct answer
----
-## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
-**Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
-**Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
 **Modified Files:**
-- **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
-- **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
-- **src/tools/**init**.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
-- **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
-- **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
-**Key Technical Decisions:**
-- **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
-- **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
-- **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
-- **Whisper model:** `small` (244MB) - best accuracy/speed balance on ZeroGPU (10-20s for 5-min video)
-- **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
-**Expected Impact:**
-- Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
-- Score: 10% → 40% (2/20 → 4/20 correct)
-- **Target achieved:** Exceeds 30% requirement (6/20)
----
-## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
-**Purpose:** Understand which parts of template are FIXED (course API contract) vs CAN MODIFY (our improvements).
-**Critical Finding:** Course API has a FIXED test setup - questions are NOT random.
-### Fixed (Course API Contract - DO NOT CHANGE)
-| Aspect                  | Value                                  | Cannot Change |
-| ----------------------- | -------------------------------------- | ------------- |
-| **API Endpoint**        | `agents-course-unit4-scoring.hf.space` | ❌            |
-| **Questions Route**     | `GET /questions`                       | ❌            |
-| **Submit Route**        | `POST /submit`                         | ❌            |
-| **Number of Questions** | **20** (always 20)                     | ❌            |
-| **Question Source**     | GAIA validation set, level 1           | ❌            |
-| **Randomness**          | **NO - Fixed set**                     | ❌            |
-| **Difficulty**          | All level 1 (easiest)                  | ❌            |
-| **Filter Criteria**     | By tools/steps complexity              | ❌            |
-| **Scoring**             | EXACT MATCH                            | ❌            |
-| **Target Score**        | 30% = 6/20 correct                     | ❌            |
-### The 20 Questions (ALWAYS the Same)
-| #   | Full Task ID                           | Description                    | Tools Required   |
-| --- | -------------------------------------- | ------------------------------ | ---------------- |
-| 1   | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator)  | Calculator       |
-| 2   | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination  | Web search       |
-| 3   | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species   | Video processing |
-| 4   | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count     | Web search       |
-| 5   | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote   | Video processing |
-| 6   | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity  | CSV file         |
-| 7   | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move  | Image analysis   |
-| 8   | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge        |
-| 9   | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character     | Web search       |
-| 10  | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe          | MP3 audio        |
-| 11  | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname    | Web search       |
-| 12  | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output             | Python execution |
-| 13  | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers  | MP3 audio        |
-| 14  | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number              | PDF processing   |
-| 15  | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city      | Web search       |
-| 16  | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count           | Web search       |
-| 17  | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search       |
-| 18  | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes        | Web search       |
-| 19  | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient    | Web search       |
-| 20  | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation   | Excel file       |
-**NOT random** - same 20 questions every submission!
-### Template Contract (MUST Preserve)
-```python
-# REQUIRED - Do NOT change
-questions_url = f"{api_url}/questions"  # Fixed route
-submit_url = f"{api_url}/submit"          # Fixed route
-submission_data = {
-    "username": username,
-    "agent_code": agent_code,
-    "answers": answers_payload            # Fixed format
-}
-```
-### Our Additions (SAFE to Modify)
-| Feature            | Purpose                | Required?   |
-| ------------------ | ---------------------- | ----------- |
-| Question Limit     | Debug: run first N     | ✅ Optional |
-| Target Task IDs    | Debug: run specific    | ✅ Optional |
-| ThreadPoolExecutor | Speed: concurrent      | ✅ Optional |
-| System Error Field | UX: error tracking     | ✅ Optional |
-| File Download (HF) | Feature: support files | ✅ Optional |
-### Key Learnings
-1. **Question set is FIXED** - not random, always same 20
-2. **API routes are FIXED** - cannot change endpoints
-3. **Submission format is FIXED** - must match exactly
-4. **Our additions are OPTIONAL** - debug/features we added
-5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
-**Reference:** `user_io/reference/project_template_original/app.py` for original structure
----
-## [2026-01-12] [Infrastructure] [COMPLETED] Original Template Reference Added
-**Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
-**Process:**
-1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
-2. Removed git-specific files (`.git/` folder, `.gitattributes`)
-3. Copied to project as `user_io/reference/project_template_original/` (static reference, no git)
-4. Cleaned up temporary clone from Downloads
-**Why Static Reference:**
-- No `.git/` folder → won't interfere with project's git
-- No `.gitattributes` → clean file comparison
-- Pure reference material for diff/comparison
-- Can see exactly what changed from original
-**Template Original Contents:**
-- `app.py` (8777 bytes - original)
-- `README.md` (400 bytes - original)
-- `requirements.txt` (15 bytes - original)
-**Comparison Commands:**
-```bash
-# Compare file sizes
-ls -lh user_io/reference/project_template_original/app.py app.py
-# See differences
-diff user_io/reference/project_template_original/app.py app.py
-# Count lines added
-wc -l app.py user_io/reference/project_template_original/app.py
 ```
-**Created Files:**
-- **\user_io/reference/project_template_original/** (NEW) - Static reference to original template (3 files)
----
-## [2026-01-12] [Infrastructure] [COMPLETED] HuggingFace Space Renamed
-**Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
-**Actions Taken:**
-1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
-2. Updated local git remote to point to new URL
-3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
-4. Pulled from remote (sync after rename - already up to date)
-5. Pushed commits to renamed Space: `c86df49..41ac444`
-**Key Learnings:**
-- Local folder name ≠ git repo identity (can rename locally without affecting remote)
-- Git remote URL determines push destination (updated to `agentbee`)
-- HuggingFace Space name is independent of local folder name
-- All work preserved through rename process
-**Current State:**
-- Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
-- Remote: `mangubee/agentbee` (renamed on HuggingFace)
-- Sync: ✅ All changes pushed
-- Git: All commits synced
-- Template: `user_io/reference/project_template_original/` added for comparison
----
-## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
-**Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
-**Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
-**Solution:** Created `docs/gaia_submission_guide.md` documenting:
-- **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
-- **Official GAIA Leaderboard** (future): 450+ questions, different submission format
-- API routes, submission formats, scoring differences
-- Development workflow for both
-**Key Clarifications:**
-| Aspect | Course | Official GAIA |
-|--------|--------|--------------|
-| API | `agents-course-unit4-scoring.hf.space` | `gaia-benchmark/leaderboard` Space |
-| Questions | 20 (level 1) | 450+ (all levels) |
-| Target | 30% (6/20) | Competitive placement |
-| Debug features | Target Task IDs, Question Limit | Must submit ALL |
-| Submission | JSON POST | File upload |
-**Created Files:**
-- **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
 **Modified Files:**
-- **README.md** - Added note linking to submission guide
----
-## [2026-01-12] [Feature] [COMPLETED] Target Specific Task IDs
-**Problem:** No way to run specific questions for debugging. Had to run full evaluation or use "first N" limit, which is inefficient for targeted fixes.
-**Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
-**Implementation:**
-- Added `eval_task_ids` textbox in UI (line 763-770)
-- Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
-- Filtering logic: Parses comma-separated IDs, filters `questions_data`
-- Shows missing IDs warning if task_id not found in dataset
-- Overrides question_limit when provided
-**Usage:**
-```
-Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
-```
 **Modified Files:**
-- **app.py** (~30 lines added)
-  - UI: `eval_task_ids` textbox
-  - `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
-  - `run_button.click()`: Pass task_ids to function
----
-## [2026-01-12] [Bug Fix] [COMPLETED] Calculator Threading Issue
-**Problem:** Calculator tool fails with `ValueError: signal only works in main thread of the main interpreter` when running in Gradio's ThreadPoolExecutor context.
-**Root Cause:** `signal.alarm()` only works in the main thread. Our agent uses `ThreadPoolExecutor` for concurrent processing (max_workers=5).
-**Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
-**Modified Files:**
-- **src/tools/calculator.py** (~15 lines modified)
-  - `timeout()` context manager: Try/except for signal.alarm() failure
-  - Logs warning when timeout protection disabled
-  - Gracefully handles Windows (AttributeError for SIGALRM)
----
-## [2026-01-12] [Feature] [COMPLETED] System Error Field
-**Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
-**Solution:** Changed to boolean `system_error: yes/no` field:
-- `system_error: yes` - Technical/system error from our code (don't submit)
-- `system_error: no` - AI response (submit answer, even if wrong)
-- Added `error_log` field with full error details for system errors
 **Implementation:**
-- `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
-- Results table: "System Error" column (yes/no), "Error Log" column (when yes)
-- JSON export: `system_error` field, `error_log` field (when system error)
-- Submission logic: Only submit when `system_error == "no"`
-**Modified Files:**
-- **app.py** (~30 lines modified)
-  - `a_determine_status()`: Returns tuple instead of string
-  - `process_single_question()`: Uses new format, adds `error_log`
-  - Results table: "System Error" + "Error Log" columns
-  - `export_results_to_json()`: Include `system_error` and `error_log`
----
-## [2026-01-12] [Refactoring] [COMPLETED] Fallback UI Removal
-**Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
-**Solution:** Removed all fallback-related UI elements:
-- Removed `enable_fallback_checkbox` from Test Question tab
-- Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
-- Removed `enable_fallback` parameter from `test_single_question()` function
-- Removed `enable_fallback` parameter from `run_and_submit_all()` function
-- Removed `ENABLE_LLM_FALLBACK` environment variable setting
-- Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
-**Modified Files:**
-- **app.py** (~20 lines removed)
-  - Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
-  - Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
-  - Updated `test_button.click()` inputs to remove checkbox reference
-  - Updated `run_button.click()` inputs to remove checkbox reference
----
-## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
-**Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
-- 4 providers to test for each feature
-- Complex debugging with multiple code paths
-- Longer, less clear error messages
-- Adding complexity without clear benefit
-**Solution:** Archive fallback mechanism, use single provider only
-- Removed fallback provider loop (Gemini → HF → Groq → Claude)
-- Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
-- If provider fails, error is raised immediately
-- Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
-**Benefits:**
-- ✅ Reduced code complexity
-- ✅ Faster debugging (one code path)
-- ✅ Clearer error messages
-- ✅ No double work on features
-**Modified Files:**
-- **src/agent/llm_client.py** (~25 lines removed)
-  - Simplified `_call_with_fallback()`: Removed fallback logic
-- **dev/dev_260112_02_fallback_archived.md** (NEW)
-  - Archived fallback code documentation
-  - Migration guide for restoration if needed
----
-## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Search Results Not Being Extracted
-**Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
-**Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
-```python
-{"results": [...], "source": "tavily", "query": "...", "count": N}
 ```
-**Solution:** Handle both dict formats in evidence extraction:
-```python
-if isinstance(result, dict):
-    if "answer" in result:
-        evidence.append(result["answer"])  # Vision tools
-    elif "results" in result:
-        # Format search results as readable text
-        results_list = result.get("results", [])
-        formatted = []
-        for r in results_list[:3]:
-            formatted.append(f"Title: {title}\nURL: {url}\nSnippet: {snippet}")
-        evidence.append("\n\n".join(formatted))  # Search tools
 ```
-**Modified Files:**
-- **src/agent/graph.py** (~40 lines modified)
-  - Updated evidence extraction in primary path
-  - Updated evidence extraction in fallback path
-**Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
-**Summary of Fixes (Session 2026-01-12):**
-1. ✅ File download from HF dataset (5/5 files)
-2. ✅ Absolute paths from script location
-3. ✅ Evidence formatting for vision tools (dict → answer)
-4. ✅ Evidence formatting for search tools (dict → formatted text)
----
-## [2026-01-12] [Evidence Formatting Fix] [COMPLETED] Dict Results Not Being Extracted
-**Problem:** Chess vision question returned "Unable to answer" even though vision tool correctly extracted the chess position.
-**Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
-**Solution:** Extract 'answer' field from dict results before adding to evidence:
-```python
-# Before
-evidence.append(f"[{tool_name}] {result}")  # Dict → string representation
-# After
-if isinstance(result, dict) and "answer" in result:
-    evidence.append(result["answer"])  # Extract answer field
-elif isinstance(result, str):
-    evidence.append(result)
-```
-**Modified Files:**
-- **src/agent/graph.py** (~15 lines modified)
-  - Updated `execute_node()`: Extract 'answer' from dict results
-  - Fixed both primary and fallback execution paths
-**Test Result:** Simple search questions now work. Chess question still fails due to vision tool extracting wrong turn indicator (w instead of b).
-**Known Issue:** Vision tool extracts "w - - 0 1" (White's turn) but question asks for Black's move. Ground truth is "Rd5" (Black move), so FEN extraction may have error.
----
-## [2026-01-12] [File Download Fix] [COMPLETED] Absolute Path Fix - Vision Tool Now Works
-**Problem:** Chess vision question returned "Unable to answer" even though file was downloaded successfully.
-**Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
-**Solution:** Return absolute paths from `download_task_file()`
-- Changed: `target_path = os.path.join(save_dir, file_name)`
-- To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
-- Now tools can find files regardless of working directory
 **Modified Files:**
-- **app.py** (~3 lines modified)
-  - Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
-**Test Result:** Vision tool now works with absolute path - correctly analyzes chess position
----
-## [2026-01-12] [File Download Fix] [COMPLETED] GAIA File API Dead End - Switch to HF Dataset
-**Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
 **Investigation:**
-- Checked API spec: Endpoint exists with proper documentation
-- Tested download: HTTP 404 "No file path associated with task_id"
-- Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
-- Confirmed via Swagger UI: Same 404 error
-**Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
-**Solution:** Switch from evaluation API to GAIA dataset download
-- Use `huggingface_hub.hf_hub_download()` to fetch files
-- Download to `_cache/gaia_files/` (runtime cache)
-- File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
-- Added cache checking (reuse downloaded files)
-**Files with attachments (5/20 questions):**
-- `cca530fc`: Chess position image (.png)
-- `99c9cc74`: Pie recipe audio (.mp3)
-- `f918266a`: Python code (.py)
-- `1f975693`: Calculus audio (.mp3)
-- `7bd855d8`: Menu sales Excel (.xlsx)
-**Modified Files:**
-- **app.py** (~70 lines modified)
-  - Updated `download_task_file()`: Changed from evaluation API to HF dataset download
-  - Changed signature: `download_task_file(task_id, file_name, save_dir)`
-  - Added `huggingface_hub` import with cache checking
-  - Default directory: `_cache/gaia_files/` (runtime cache, not git)
-  - Flat file structure: `_cache/gaia_files/{file_name}`
-- **app.py** (~5 lines modified)
-  - Updated `process_single_question()`: Pass `file_name` to download function
-**Known Limitations:**
-- Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
-- `.mp3` audio files still unsupported
-- `.py` code execution still unsupported
-**Next Steps:**
-1. Test new download implementation
-2. Expand tool support for .mp3 (audio transcription)
-3. Expand tool support for .py (code execution)
----
-## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision LLM Validated - Ready for GAIA
-**Problem:** Need to validate HF vision works before complex GAIA evaluation.
-**Solution:** Single smoke test with simple red square image.
-**Result:** ✅ PASSED
-- Model: `google/gemma-3-27b-it:scaleway`
-- Answer: "The image is a solid, uniform field of red color..."
-- Provider routing: Working correctly
-- Settings integration: Fixed
-**Modified Files:**
-- **src/config/settings.py** (~5 lines added)
-  - Added `HF_TOKEN` and `HF_VISION_MODEL` config
-  - Added `hf_token` and `hf_vision_model` to Settings class
-  - Updated `validate_api_keys()` to include huggingface
-- **test/test_smoke_hf_vision.py** (NEW - ~50 lines)
-  - Simple smoke test script
-  - Tests basic image description
-**Bug Fixes:**
-- Removed unsupported `timeout` parameter from `chat_completion()`
-**Next Steps:** Phase 3 - GAIA evaluation with HF vision
----
-## [2026-01-11] [Phase 1: Implementation] [COMPLETED] HF Vision Integration - Routing Fixed
-**Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
-**Solution:**
-- Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
-- Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
-- Each provider fails independently (NO fallback chains during testing)
-**Modified Files:**
-- **src/tools/vision.py** (~120 lines added/modified)
-  - Added `analyze_image_hf()` function with retry logic
-  - Updated `analyze_image()` routing with provider selection
-  - Added HF_VISION_MODEL and HF_TIMEOUT config
-- **.env.example** (~4 lines added)
-  - Documented HF_TOKEN and HF_VISION_MODEL settings
-**Validated Models (Phase 0 Extended Testing):**
-| Rank | Model                            | Provider | Speed | Notes                          |
-| ---- | -------------------------------- | -------- | ----- | ------------------------------ |
-| 1    | `google/gemma-3-27b-it`          | Scaleway | ~6s   | **RECOMMENDED** - Google brand |
-| 2    | `CohereLabs/aya-vision-32b`      | Cohere   | ~7s   | Fast, less known brand         |
-| 3    | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita   | ~14s  | Qwen brand, reputable          |
-| 4    | `zai-org/GLM-4.6V-Flash`         | zai-org  | ~16s  | Zhipu AI brand                 |
-**Failed Models (not vision-capable):**
-- `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
-- `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
-- `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
-- `moonshotai/Kimi-K2-Instruct-0905:novita` - 400 Bad request
-**Next Steps:** Smoke tests (Phase 2) to validate integration
----
-## [2026-01-11] [Phase 0 Extended] [COMPLETED] Additional Vision Models Tested - Google Gemma 3 Selected
-**Problem:** Needed to find more reputable vision models (aya-vision-32b brand unknown to user).
-**Solution:** Tested user-requested models with provider routing.
-**Test Results:**
-**Working Models:**
-- `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
-- `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
-- `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
-**Failed Models:**
-- `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
-- `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
-- `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
-- `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
-**Output Files:**
-- `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
-- `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
-- `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
-- `output/phase0_vision_validation_20260111_164945.json` - Gemma-3-27B test
-**Decision:** Use `google/gemma-3-27b-it:scaleway` for production (fastest, most reputable brand)
----
-## [2026-01-07] [Phase 0: API Validation] [COMPLETED] HF Inference Vision Support - GO Decision
-**Problem:** Needed to validate HF Inference API supports vision models before implementation.
----
-### Knowledge Updates
-**Solution - Phase 0 Validation Results:**
-**✅ GO Decision - Proceed to Phase 1**
-**Final Working Model (Recommended):**
-- **CohereLabs/aya-vision-32b** (32B params, Cohere provider) - ✅ **PRODUCTION READY**
-  - Handles small images (1KB base64): ~1-3 seconds
-  - Handles large images (2.8MB base64): ~10 seconds, no timeout
-  - Excellent quality: Detailed scene understanding, object identification, spatial relationships
-  - Sample response on workspace image: "The image depicts a serene workspace setup on a wooden desk...white ceramic mug filled with dark liquid...silver laptop...rolled-up paper secured with rubber band..."
-**Partially Working Models (Timeout Issues with Large Images):**
-1. **Qwen/Qwen3-VL-8B-Instruct** (8B params, Novita provider) - ⚠️ Conditionally working
-   - Small images (1KB): ✅ Works
-   - Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
-   - Only works with models that have `?inference_provider=` in URL
-2. **baidu/ERNIE-4.5-VL-424B-A47B-Base-PT** (424B params, Novita provider) - ⚠️ Conditionally working
-   - Small images (1KB): ✅ Works
-   - Large images (2.8MB): ❌ 504 Gateway Timeout (>120 seconds)
-**Failed Models:**
-1. `deepseek-ai/DeepSeek-OCR` - Not exposed via Inference API (requires local GPU)
-   - Attempted both chat_completion and image_to_text endpoints
-   - Error: "Task 'image-to-text' not supported for provider 'novita'"
-   - Solution: Must use transformers library locally (not serverless API)
-2. `CohereLabs/command-a-vision-07-2025` - 429 rate limit (try later)
-3. `zai-org/GLM-4.1V-9B-Thinking` - Provider doesn't support model
-4. `microsoft/Phi-3.5-vision-instruct` - Not enabled for serverless
-5. `meta-llama/Llama-3.2-11B-Vision-Instruct` - Not enabled for serverless
-6. `Qwen/Qwen2-VL-72B-Instruct` - Not enabled for serverless
-**Working Format:** Base64 encoding only
-- ✅ Base64: Works for all providers
-- ❌ File path (file:// URL): Failed with 400 Bad Request
-- ❌ Direct image parameter: API doesn't support
-**Critical Discovery - Large Image Handling:**
-| Model                | Small Image (1KB) | Large Image (2.8MB) | Recommendation               |
-| -------------------- | ----------------- | ------------------- | ---------------------------- |
-| aya-vision-32b       | ✅ 1-3s           | ✅ ~10s             | **Use for production**       |
-| Qwen3-VL-8B-Instruct | ✅ 1-3s           | ❌ >120s timeout    | Use with image preprocessing |
-| ERNIE-4.5-VL-424B    | ✅ 1-3s           | ❌ >120s timeout    | Use with image preprocessing |
-**API Behavior:**
-- Response format: Standard chat completion with content field
-- Rate limits: 429 possible (command-a-vision hit this)
-- Errors: Clear error messages in JSON format
-- Latency: 1-3 seconds for small images, 10 seconds for large images (aya only)
-- Timeout: 120 seconds default (Novita times out on large images)
-**Key Learning - Inference Provider Pattern:**
-- Models with `?inference_provider=PROVIDER` in URL ARE accessible via serverless API
-- Example: `huggingface.co/Qwen/Qwen3-VL-8B-Instruct?inference_provider=novita` ✅
-- Models without provider parameter (DeepSeek-OCR) require local deployment
-**Recommendation for Phase 1:**
-- Primary: `CohereLabs/aya-vision-32b` (handles all image sizes, Cohere provider reliable)
-- Format: Base64 encode images in messages array
-- Consider image preprocessing (resize/compress) for non-Cohere providers
-- Set 120+ second timeouts for large images
-**HF Pro Account Context:**
-- Free accounts: $0.10/month credits, NO pay-as-you-go
-- Pro accounts ($9/mo): $2.00/month credits, CAN use pay-as-you-go after credits
-- Provider charges pass-through directly (no HF markup)
-- Pro required for production workloads with uninterrupted access
 **Next Steps:**
-- Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
-- Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
-- Phase 1: Add image preprocessing for large files (resize if >1MB)
-**Test Images:**
-- `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
-- `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
----
-### Code Changes
-**Modified Files:**
-- **test/test_phase0_hf_vision_api.py** (NEW - ~400 lines)
-  - Phase 0 validation script
-  - Tests multiple vision models
-  - Tests multiple image formats
-  - Exports results to JSON
-  - OCR model testing support (image_to_text endpoint)
-**Output Files:**
-- **output/phase0_vision_validation_20260107_174401.json** - Initial test (red square image)
-- **output/phase0_vision_validation_20260107_174146.json** - First attempt (no models worked)
-- **output/phase0_vision_validation_20260107_182113.json** - DeepSeek-OCR test
-- **output/phase0_vision_validation_20260107_182155.json** - Qwen3-VL discovery
-- **output/phase0_vision_validation_20260107_184839.json** - Real image test (workspace photo)
-**Next Steps:**
-- Phase 1: Implement `analyze_image_hf()` using aya-vision-32b
-- Phase 1: Fix vision tool routing to respect `LLM_PROVIDER`
-- Phase 1: Add image preprocessing for large files (resize if >1MB)
-**Test Images:**
-- `test/fixtures/test_image_red_square.jpg` - Simple test image (825 bytes)
-- `test/fixtures/test_image_real.png` - Complex workspace photo (2.1MB file, 2.8MB base64)
----
-## [2026-01-06] [Plan Revision] [COMPLETED] HuggingFace Vision Integration Plan - Corrected Architecture
-**Problem:** Initial plan had critical gaps that would waste implementation time:
-- Missing Phase 0 API validation (could implement non-functional approach)
-- Included fallback logic during testing (defeats isolation purpose)
-- Wrong model selection order (large → small, should be small → large)
-- No smoke tests before GAIA (would debug complex questions with broken integration)
-- Premature cost optimization
-**Solution - Plan Corrections Applied:**
-1. **Added Phase 0: API Validation (CRITICAL)**
-   - Test HF Inference API with vision models BEFORE implementation
-   - Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
-   - Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
-   - Time saved: Prevents 2-3 hours implementing non-functional code
-2. **Removed Fallback Logic from Testing**
-   - Each provider fails independently with clear error message
-   - NO fallback chains (HF → Gemini → Claude) during testing
-   - Philosophy: Build capability knowledge, don't hide problems
-   - Log exact failure reasons for debugging
-3. **Added Smoke Tests (Phase 2)**
-   - 4 tests before GAIA: description, OCR, counting, single GAIA question
-   - Decision gate: ≥3/4 must pass before full evaluation
-   - Prevents debugging chess positions when basic integration broken
-4. **Added Decision Gates**
-   - Gate 1 (Phase 0): API validation → GO/NO-GO
-   - Gate 2 (Phase 2): Smoke tests → GO/NO-GO
-   - Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
-5. **Added Backup Strategy Documentation**
-   - Option C: HF Spaces deployment (custom endpoint)
-   - Option D: Local transformers library (no API)
-   - Option E: Hybrid (HF text + Gemini/Claude vision)
-6. **Separate Results Per Provider**
-   - Export format: `gaia_results_hf_TIMESTAMP.json` (HF only)
-   - Build capability matrix: which provider for which tasks
-   - No combined/fallback results during testing
-**Modified Files:**
-- **PLAN.md** (~200 lines restructured)
-  - Phase 0: API Validation (NEW)
-  - Phase 1: Implementation (revised - no fallbacks)
-  - Phase 2: Smoke Tests (NEW)
-  - Phase 3: GAIA Evaluation (revised)
-  - Phase 4: Media Processing (YouTube, audio)
-  - Phase 5: Groq Integration (future)
-  - Phase 6: Final Verification
-  - Added: Backup Strategy Options section
-  - Added: Decision Gates Summary section
-  - Updated: Files to Modify (10 files total)
-  - Updated: Success Criteria (per-phase)
-**Key Changes Summary:**
-| Before                        | After                               |
-| ----------------------------- | ----------------------------------- |
-| Jump to implementation        | Phase 0: Validate API first         |
-| Fallback chains               | No fallbacks, fail independently    |
-| Large models first (Qwen2-VL) | Small models first (Phi-3.5)        |
-| Direct to GAIA                | Smoke tests → GAIA                  |
-| No backup plan                | 3 backup options documented         |
-| Single success criteria       | Per-phase criteria + decision gates |
-**Benefits:**
-- ✅ Prevents wasted implementation time on non-functional approach
-- ✅ Clear debugging with isolated provider failures
-- ✅ Faster iteration with small models
-- ✅ Risk mitigation with decision gates
-- ✅ Backup options if HF API doesn't support vision
-**Next Steps:** Proceed to Phase 0 (API validation) when implementation starts
----
-## [2026-01-06] [Stage 5 Investigation] [COMPLETED] Vision Tool Ignores UI LLM Selection - Root Cause of 0% Accuracy
-**Problem:** Stage 5 claimed 25% accuracy (5/20 correct) but actual results show 0% accuracy (0/20 correct). User selected HuggingFace in UI but vision questions still failing.
-**Investigation Findings:**
-**Ground Truth Analysis (output/gaia_results_20260105_203102.json):**
-- Actual score: 0% (0/20 correct) - complete failure
-- Stage 5 dev record claimed: 25% (5/20 correct) - false success claim
-- Regression from baseline 10% → 0%
-**Failure Pattern Breakdown:**
-1. **Vision tool failures:** 40% of questions (8/20)
-   - Error: "Vision analysis failed - Gemini and Claude both failed"
-   - Questions: Chess position, YouTube videos, audio file parsing
-2. **Calculator threading error:** 5% of questions (1/20)
-   - Error: "ValueError: signal only works in main thread of the main interpreter"
-   - Root cause: `signal.alarm()` doesn't work in Gradio async context
-3. **Wrong answers:** 55% of questions (11/20)
-   - Tools work, but answer synthesis produces incorrect factoids
-   - Example: Mercedes Sosa albums - submitted "4", correct "3"
-**Root Cause - Vision Tool Bug:**
-**Critical bug in `src/tools/vision.py:303-339`:**
-- Vision tool HARDCODED to always try Gemini → Claude fallback
-- Never checks `os.getenv("LLM_PROVIDER")` setting
-- Ignores UI LLM selection completely
-- Other tools (planning, tool selection, synthesis) correctly respect UI selection
-**Code Evidence:**
-```python
-def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
-    # MISSING: No check for os.getenv("LLM_PROVIDER")
-    # HARDCODED: Always try Gemini first
-    if settings.google_api_key:
-        return analyze_image_gemini(image_path, question)
-    # HARDCODED: Always fallback to Claude
-    if settings.anthropic_api_key:
-        return analyze_image_claude(image_path, question)
-```
-**Impact:**
-- When user selects "HuggingFace" in UI:
-  - ✅ Planning uses HuggingFace
-  - ✅ Tool selection uses HuggingFace
-  - ❌ Vision still calls Gemini/Claude (ignores selection)
-  - Result: 40% of questions auto-fail due to Gemini/Claude quota exhaustion
-**Additional Issue:**
-- HuggingFace Inference API free tier doesn't support multimodal vision analysis
-- Even if bug fixed, HF can't handle vision questions
-**Modified Files:**
-- **NONE** (investigation only - no code changes yet)
-**Next Steps Identified:**
-1. Fix vision tool to respect `LLM_PROVIDER` setting
-2. Add proper error handling when HF selected for vision questions
-3. Fix calculator threading issue (`signal.alarm()` in async context)
-4. Improve answer synthesis prompts
-5. Add verification protocol: MUST verify claims with actual JSON output
-**Current Baseline:** 0% (need to fix regressions before optimizing)
-**Target:** 30% minimum (6/20 questions)
----
-## [2026-01-05] [Runtime Cache Folder] [COMPLETED] Eliminate exports/ Redundancy
-**Problem:**
-- Environment-dependent paths: `~/Downloads` (local) vs `./exports` (HF Spaces)
-- `exports/` folder name confusing - looked like user-facing folder
-- Files visible in HF UI when committed to git
-- User couldn't locate where files were saved
-**Solution:**
-- Single `_cache/` folder for all environments (local, HF Spaces)
-- Name clearly indicates internal runtime storage (not user-accessible via file browser)
-- Files served via app download button, not HF Spaces UI
-- Added to .gitignore to keep runtime files out of git
-**Modified Files:**
-- **app.py** (~10 lines modified)
-  - Removed environment detection logic (`if os.getenv("SPACE_ID")`)
-  - Changed: `exports/` → `_cache/`
-  - Updated docstring: "All environments: Saves to ./\_cache/gaia_results_TIMESTAMP.json"
-  - Updated comment: "Save to \_cache/ folder (internal runtime storage, not accessible via HF UI)"
-- **.gitignore** (~3 lines added)
-  - Added `_cache/` to ignore list
-  - Added comment explaining runtime cache behavior
-**Benefits:**
-- ✅ Single location for all environments (no environment detection)
-- ✅ Clear naming indicates internal storage (not user-facing)
-- ✅ Files accessible via download button
-- ✅ Not visible in HF Spaces file browser
-- ✅ Not committed to git
-**File Lifecycle on HF Spaces:**
-- Files persist on server between runs (accumulate in `_cache/`)
-- Wiped clean on redeploy (container rebuild)
-- Standard container behavior: runtime storage is temporary
-- No manual cleanup needed (redeploy handles it)

 # Session Changelog
+## [2026-01-14] [Enhancement] [COMPLETED] Unified Log Format - Markdown Standard
+**Problem:** Inconsistent log formats across different components, wasteful `====` separators.
+**Solution:** Standardize all logs to Markdown format with clean structure.
+**Unified Log Standard:**
+```markdown
+# Title
+**Key:** value
+**Key:** value
+## Section
+Content
+```
+**Files Updated:**
+1. **LLM Session Logs** (`llm_session_*.md`):
+   - Header: `# LLM Synthesis Session Log`
+   - Questions: `## Question [timestamp]`
+   - Sections: `### Evidence & Prompt`, `### LLM Response`
+   - Code blocks: triple backticks
+2. **YouTube Transcript Logs** (`{video_id}_transcript.md`):
+   - Header: `# YouTube Transcript`
+   - Metadata: `**Video ID:**`, `**Source:**`, `**Length:**`
+   - Content: `## Transcript`
+**Note:** No horizontal rules (`---`) - already banned in global CLAUDE.md, breaks collapsible sections
+**Token Savings:**
+| Style             | Tokens per separator | 20 questions |
+| ----------------- | -------------------- | ------------ |
+| `====` x 80 chars | ~40 tokens           | ~800 tokens  |
+| `##` heading      | ~2 tokens            | ~40 tokens   |
+**Savings:** ~760 tokens per session (95% reduction)
+**Benefits:**
+- ✅ Collapsible headings in all Markdown editors
+- ✅ Consistent structure across all log files
+- ✅ Token-efficient for LLM processing
+- ✅ Readable in both rendered and plain text
+- ✅ `.md` extension for proper syntax highlighting
 **Modified Files:**
+- `src/agent/llm_client.py` (LLM session logs)
+- `src/tools/youtube.py` (transcript logs)
+- `CLAUDE.md` (added unified log format standard)
+## [2026-01-14] [Cleanup] [COMPLETED] Session Log Optimization - Reduce Static Content Redundancy
+**Problem:** System prompt (~30 lines) was written for every question (20x = 600 lines of redundant text).
+**Solution:** Write system prompt once on first question, skip for subsequent questions.
 **Implementation:**
+- Added `_SYSTEM_PROMPT_WRITTEN` flag to track if system prompt was logged
+- First question includes full SYSTEM PROMPT section
+- Subsequent questions only show dynamic content (question, evidence, response)
+**Log format comparison:**
+Before (every question):
 ```
+QUESTION START
+SYSTEM PROMPT: [30 lines repeated]
+USER PROMPT: [dynamic]
+LLM RESPONSE: [dynamic]
 ```
+After (first question):
 ```
+SYSTEM PROMPT (static - used for all questions): [30 lines]
+QUESTION [...]
+EVIDENCE & PROMPT: [dynamic]
+LLM RESPONSE: [dynamic]
 ```
+After (subsequent questions):
 ```
+QUESTION [...]
+EVIDENCE & PROMPT: [dynamic]
+LLM RESPONSE: [dynamic]
 ```
+**Result:** ~570 lines less redundancy per 20-question evaluation.
 **Modified Files:**
+- `src/agent/llm_client.py` (~30 lines modified - added flag, conditional logging)
+## [2026-01-14] [Bugfix] [COMPLETED] Session Log Synchronization - Atomic Per-Question Logging
+**Problem:** When processing multiple questions, LLM responses were written out of order relative to their questions, causing mismatched prompts/responses in session logs.
+**Root Cause:** `synthesize_answer_hf()` wrote QUESTION START immediately, but appended LLM RESPONSE later after API call completed. With concurrent processing, responses finished in different order.
+**Solution:** Buffer complete question block in memory, write atomically when response arrives:
+```python
+# Before (broken):
+write_question_start()  # immediate
+api_response = call_llm()
+write_llm_response()     # later, out of order
+# After (fixed):
+question_header = buffer_question_start()
+api_response = call_llm()
+complete_block = question_header + response + end
+write_atomic(complete_block)  # all at once
 ```
+**Result:** Each question block is self-contained, no mismatched prompts/responses.
 **Modified Files:**
+- `src/agent/llm_client.py` (~40 lines modified - synthesize_answer_hf function)
+## [2026-01-13] [Cleanup] [COMPLETED] LLM Session Log Format - Removed Duplicate Evidence
+**Problem:** Evidence appeared twice in session log - once in USER PROMPT section, again in EVIDENCE ITEMS section.
+**Solution:** Removed standalone EVIDENCE ITEMS section, kept evidence in USER PROMPT only.
+**Rationale:** USER PROMPT shows what's actually sent to the LLM (system + user messages together).
 **Modified Files:**
+- `src/agent/llm_client.py` - Removed duplicate logging section (lines 1189-1194 deleted)
+**Result:** Cleaner logs, no duplication
+## [2026-01-13] [Feature] [COMPLETED] YouTube Frame Processing Mode - Visual Video Analysis
+**Problem:** Transcript mode captures audio but misses visual information (objects, scenes, actions).
+**Solution:** Implemented frame extraction and vision-based video analysis mode.
 **Implementation:**
+**1. Frame Extraction (`src/tools/youtube.py`):**
+- `download_video()` - Downloads video using yt-dlp
+- `extract_frames()` - Extracts N frames at regular intervals using OpenCV
+- `analyze_frames()` - Analyzes frames with vision models
+- `process_video_frames()` - Complete frame processing pipeline
+- `youtube_analyze()` - Unified API with mode parameter
+**2. CONFIG Settings:**
+- `FRAME_COUNT = 6` - Number of frames to extract
+- `FRAME_QUALITY = "worst"` - Download quality (faster)
+**3. UI Integration (`app.py`):**
+- Added radio button: "YouTube Processing Mode"
+- Choices: "Transcript" (default) or "Frames"
+- Sets `YOUTUBE_MODE` environment variable
+**4. Updated Dependencies:**
+- `requirements.txt` - Added `opencv-python>=4.8.0`
+- `pyproject.toml` - Added via `uv add opencv-python`
+**5. Tool Description Update (`src/tools/__init__.py`):**
+- Updated `youtube_transcript` description to mention both modes
+**Architecture:**
 ```
+youtube_transcript() → reads YOUTUBE_MODE env
+                        ├─ "transcript" → audio/subtitle extraction
+                        └─ "frames" → video download → extract 6 frames → vision analysis
 ```
+**Test Result:**
+- Successfully processed video with 6 frames analyzed
+- Each frame analyzed with vision model, combined output returned
+- Frame timestamps: 0s, 20s, 40s, 60s, 80s, 100s (spread evenly)
+**Known Limitation:**
+- Frame sampling is random (regular intervals)
+- Low probability of capturing transient events (~5.5% for 108s video)
+- Future: Hybrid mode using timestamps to guide frame extraction (documented in `user_io/knowledge/hybrid_video_audio_analysis.md`)
+**Status:** Implemented and tested, ready for use
 **Modified Files:**
+- `src/tools/youtube.py` (~200 lines added - frame extraction + analysis)
+- `app.py` (~5 lines modified - UI toggle)
+- `requirements.txt` (1 line added - opencv-python)
+- `src/tools/__init__.py` (1 line modified - tool description)
+## [2026-01-13] [Investigation] [OPEN] HF Spaces vs Local Performance Discrepancy
+**Problem:** HF Space deployment shows significantly lower scores (5%) than local execution (20-30%).
 **Investigation:**
+| Environment      | Score  | System Errors | NoneType Errors |
+| ---------------- | ------ | ------------- | --------------- |
+| **Local**        | 20-30% | 3 (15%)       | 1               |
+| **HF ZeroGPU**   | 5%     | 5 (25%)       | 3               |
+| **HF CPU Basic** | 5%     | 5 (25%)       | 3               |
+**Verified:** Code is 100% identical (cloned HF Space repo, git history matches at commit `3dcf523`).
+**Issue:** HF Spaces infrastructure causes LLM to return empty/None responses during synthesis.
+**Known Limitations (Local 30% Run):**
+- 3 system errors: reverse text (calculator), chess vision (NoneType), Python .py execution
+- 10 "Unable to answer": search evidence extraction issues
+- 1 wrong answer: Wikipedia dinosaur (Jimfbleak vs FunkMonk)
+**Resolution:** Competition accepts local results. HF Spaces deployment not required.
+**Status:** OPEN - Infrastructure Issue, Won't Fix (use local execution)
+## [2026-01-13] [Infrastructure] [COMPLETED] 3-Tier Folder Naming Convention
+**Problem:** Previous rename used `_` prefix for both runtime folders AND user-only folders, creating ambiguity.
+**Solution:** Implemented 3-tier naming convention to clearly distinguish folder purposes.
+**3-Tier Convention:**
+1. **User-only** (`user_*` prefix) - Manual use, not app runtime:
+   - `user_input/` - User testing files, not app input
+   - `user_output/` - User downloads, not app output
+   - `user_dev/` - Dev records (manual documentation)
+   - `user_archive/` - Archived code/reference materials
+2. **Runtime/Internal** (`_` prefix) - App creates, temporary:
+   - `_cache/` - Runtime cache, served via app download
+   - `_log/` - Runtime logs, debugging
+3. **Application** (no prefix) - Permanent code:
+   - `src/`, `test/`, `docs/`, `ref/` - Application folders
+**Folders Renamed:**
+- `_input/` → `user_input/` (user testing files)
+- `_output/` → `user_output/` (user downloads)
+- `dev/` → `user_dev/` (dev records)
+- `archive/` → `user_archive/` (archived materials)
+**Folders Unchanged (correct tier):**
+- `_cache/`, `_log/` - Runtime ✓
+- `src/`, `test/`, `docs/`, `ref/` - Application ✓
+**Updated Files:**
+- **test/test_phase0_hf_vision_api.py** - `Path("_output")` → `Path("user_output")`
+- **.gitignore** - Updated folder references and comments
+**Git Status:**
+- Old folders removed from git tracking
+- New folders excluded by .gitignore
+- Existing files become untracked
+**Result:** Clear 3-tier structure: user*\*, *\*, and no prefix
+## [2026-01-13] [Infrastructure] [COMPLETED] Runtime Folder Naming Convention - Underscore Prefix
+**Problem:** Folders `log/`, `output/`, and `input/` didn't clearly indicate they were runtime-only storage, making it unclear which folders are internal vs permanent.
+**Solution:** Renamed all runtime-only folders to use `_` prefix, following Python convention for internal/private.
+**Folders Renamed:**
+- `log/` → `_log/` (runtime logs, debugging)
+- `output/` → `_output/` (runtime results, user downloads)
+- `input/` → `_input/` (user testing files, not app input)
+**Rationale:**
+- `_` prefix signals "internal, temporary, not part of public API"
+- Consistent with Python convention (`_private`, `__dunder__`)
+- Distinguishes runtime storage from permanent project folders
+**Updated Files:**
+- `src/agent/llm_client.py` - `Path("log")` → `Path("_log")`
+- `src/tools/youtube.py` - `Path("log")` → `Path("_log")`
+- `test/test_phase0_hf_vision_api.py` - `Path("output")` → `Path("_output")`
+- `.gitignore` - Updated folder references
+**Result:** Runtime folders now clearly marked with `_` prefix
+## [2026-01-13] [Documentation] [COMPLETED] Log Consolidation - Session-Level Logging
+**Problem:** Each question created separate log file (`llm_context_TIMESTAMP.txt`), polluting the log/ folder with 20+ files per evaluation.
+**Solution:** Implemented session-level log file where all questions append to single file.
+**Implementation:**
+- Added `get_session_log_file()` function in `src/agent/llm_client.py`
+- Creates `log/llm_session_YYYYMMDD_HHMMSS.txt` on first use
+- All questions append to same file with question delimiters
+- Added `reset_session_log()` for testing/new runs
+**Updated File:**
+- `src/agent/llm_client.py` (~40 lines added)
+  - Session log management (lines 62-99)
+  - Updated `synthesize_answer_hf` to append to session log
+**Result:** One log file per evaluation instead of 20+
+## [2026-01-13] [Infrastructure] [COMPLETED] Project Template Reference Move
+**Problem:** Project template moved to new location, documentation references outdated.
+**Solution:** Updated CHANGELOG.md references to new template location.
+**Changes:**
+- Moved: `project_template_original/` → `ref/project_template_original/`
+- Updated CHANGELOG.md (7 occurrences)
+- Added `ref/` to .gitignore (static copies, not in git)
+**Result:** Documentation reflects new template location
+## [2026-01-12] [Infrastructure] [COMPLETED] Git Ignore Fixes - PDF Commit Block
+**Problem:** Git push rejected due to binary files in `docs/` folder.
+**Solution:**
+1. Reset commit: `git reset --soft HEAD~1`
+2. Added `docs/*.pdf` to .gitignore
+3. Removed PDF files from git: `git rm --cached "docs/*.pdf"`
+4. Recommitted without PDFs
+5. Push successful
+**User feedback:** "can just gitignore all the docs also"
+**Final Fix:** Changed `docs/*.pdf` to `docs/` to ignore entire docs folder
+**Updated Files:**
+- `.gitignore` - Added `docs/` folder ignore
+**Result:** Clean git history, no binary files committed
+## [2026-01-13] [Documentation] [COMPLETED] 30% Results Analysis - Phase 1 Success
+**Problem:** Need to analyze results to understand what's working and what needs improvement.
+**Analysis of gaia_results_20260113_174815.json (30% score):**
+**Results Breakdown:**
+- **6 Correct** (30%):
+  - `a1e91b78` (YouTube bird count) - Phase 1 fix working ✓
+  - `9d191bce` (YouTube Teal'c) - Phase 1 fix working ✓
+  - `6f37996b` (CSV table) - Calculator working ✓
+  - `1f975693` (Calculus MP3) - Audio transcription working ✓
+  - `99c9cc74` (Strawberry pie MP3) - Audio transcription working ✓
+  - `7bd855d8` (Excel food sales) - File parsing working ✓
+- **3 System Errors** (15%):
+  - `2d83110e` (Reverse text) - Calculator: SyntaxError
+  - `cca530fc` (Chess position) - NoneType error (vision)
+  - `f918266a` (Python code) - parse_file: ValueError
+- **10 "Unable to answer"** (50%):
+  - Search evidence extraction insufficient
+  - Need better LLM prompts or search processing
+- **1 Wrong Answer** (5%):
+  - `4fc2f1ae` (Wikipedia dinosaur) - Found "Jimfbleak" instead of "FunkMonk"
+**Phase 1 Impact (YouTube + Audio):**
+- Fixed 4 questions that would have failed before
+- YouTube transcription with Whisper fallback working
+- Audio transcription working well
 **Next Steps:**
+1. Fix 3 system errors (text manipulation, vision NoneType, Python execution)
+2. Improve search evidence extraction (10 questions)
+3. Investigate wrong answer (Wikipedia search precision)
+## [2026-01-13] [Feature] [COMPLETED] Phase 1: YouTube + Audio Transcription Support
+**Problem:** Questions with YouTube videos and audio files couldn't be answered.
+**Solution:** Implemented two-phase transcription system.
+**YouTube Transcription (`src/tools/youtube.py`):**
+- Extracts transcript using `youtube_transcript_api`
+- Falls back to Whisper audio transcription if captions unavailable
+- Saves transcript to `_log/{video_id}_transcript.txt`
+**Audio Transcription (`src/tools/audio.py`):**
+- Uses Groq's Whisper-large-v3 model (ZeroGPU compatible)
+- Supports MP3, WAV, M4A, OGG, FLAC, AAC formats
+- Saves transcript to `_log/` for debugging
+**Impact:**
+- 4 additional questions answered correctly (30% vs ~10% before)
+- `9d191bce` (YouTube Teal'c) - "Extremely" ✓
+- `a1e91b78` (YouTube birds) - "3" ✓
+- `1f975693` (Calculus MP3) - "132, 133, 134, 197, 245" ✓
+- `99c9cc74` (Strawberry pie MP3) - Full ingredient list ✓
+**Status:** Phase 1 complete, hit 30% target score
+## [2026-01-12] [Infrastructure] [COMPLETED] Session Log Implementation
+**Problem:** Need to track LLM synthesis context for debugging and analysis.
+**Solution:** Created session-level logging system in `src/agent/llm_client.py`.
+**Implementation:**
+- Session log: `_log/llm_session_YYYYMMDD_HHMMSS.txt`
+- Per-question log: `_log/{video_id}_transcript.txt` (YouTube only)
+- Captures: questions, evidence items, LLM prompts, answers
+- Structured format with timestamps and delimiters
+**Result:** Full audit trail for debugging failed questions
+## [2026-01-13] [Infrastructure] [COMPLETED] Git Commit & HF Push
+**Problem:** Need to deploy changes to HuggingFace Spaces.
+**Solution:** Committed and pushed latest changes.
+**Commit:** `3dcf523` - "refactor: update folder structure and adjust output paths"
+**Changes Deployed:**
+- 3-tier folder naming convention
+- Session-level logging
+- Project template reference move
+- Git ignore fixes
+**Result:** HF Space updated with latest code
+## [2026-01-13] [Testing] [COMPLETED] Phase 0 Vision API Validation
+**Problem:** Need to validate vision API works before integrating into agent.
+**Solution:** Created test suite `test/test_phase0_hf_vision_api.py`.
+**Test Results:**
+- Tested 4 image sources
+- Validated multimodal LLM responses
+- Confirmed HF Inference API compatibility
+- Identified NoneType edge case (empty responses)
+**File:** `user_io/result_ServerApp/phase0_vision_validation_*.json`
+**Result:** Vision API validated, ready for integration
+## [2026-01-11] [Feature] [COMPLETED] Multi-Modal Vision Support
+**Problem:** Agent couldn't process image-based questions (chess positions, charts, etc.).
+**Solution:** Implemented vision tool using HuggingFace Inference API.
+**Implementation (`src/tools/vision.py`):**
+- `analyze_image()` - Main vision analysis function
+- Supports JPEG, PNG, GIF, BMP, WebP formats
+- Returns detailed descriptions of visual content
+- Fallback to Gemini/Claude if HF fails
+**Status:** Implemented, some NoneType errors remain
+## [2026-01-10] [Feature] [COMPLETED] File Parser Tool
+**Problem:** Agent couldn't read uploaded files (PDF, Excel, Word, CSV, etc.).
+**Solution:** Implemented unified file parser (`src/tools/file_parser.py`).
+**Supported Formats:**
+- PDF (`parse_pdf`) - PyPDF2 extraction
+- Excel (`parse_excel`) - Calamine-based parsing
+- Word (`parse_word`) - python-docx extraction
+- Text/CSV (`parse_text`) - UTF-8 text reading
+- Unified `parse_file()` - Auto-detects format
+**Result:** Agent can now read file attachments
+## [2026-01-09] [Feature] [COMPLETED] Calculator Tool
+**Problem:** Agent couldn't perform mathematical calculations.
+**Solution:** Implemented safe expression evaluator (`src/tools/calculator.py`).
+**Features:**
+- `safe_eval()` - Safe math expression evaluation
+- Supports: arithmetic, algebra, trigonometry, logarithms
+- Constants: pi, e
+- Functions: sqrt, sin, cos, log, abs, etc.
+- Error handling for invalid expressions
+**Result:** CSV table question answered correctly (`6f37996b`)
+## [2026-01-08] [Feature] [COMPLETED] Web Search Tool
+**Problem:** Agent couldn't access current information beyond training data.
+**Solution:** Implemented web search using Tavily API (`src/tools/web_search.py`).
+**Features:**
+- `tavily_search()` - Primary search via Tavily
+- `exa_search()` - Fallback via Exa (if available)
+- Unified `search()` - Auto-fallback chain
+- Returns structured results with titles, snippets, URLs
+**Configuration:**
+- `TAVILY_API_KEY` required
+- `EXA_API_KEY` optional (fallback)
+**Result:** Agent can now search web for current information
+## [2026-01-07] [Infrastructure] [COMPLETED] Project Initialization
+**Problem:** New project setup required.
+**Solution:** Initialized project structure with standard files.
+**Created:**
+- `README.md` - Project documentation
+- `CLAUDE.md` - Project-specific AI instructions
+- `CHANGELOG.md` - Session tracking
+- `.gitignore` - Git exclusions
+- `requirements.txt` - Dependencies
+- `pyproject.toml` - UV package config
+**Result:** Project scaffold ready for development
+**Date:** YYYY-MM-DD
+**Dev Record:** [link to dev/dev_YYMMDD_##_concise_title.md]
+## What Was Changed
+- Change 1
+- Change 2

CLAUDE.md CHANGED Viewed

@@ -4,26 +4,57 @@
 ## Logging Standard
 **Console Output (Status Workflow):**
 - **Compressed status updates:** `[node] ✓ result` or `[node] ✗ error`
 - **Progress indicators:** `[1/1] Processing task_id`, `[1/20]` for batch
 - **Key milestones only:** 3-4 statements vs verbose logs
 - **Node labels:** `[plan]`, `[execute]`, `[answer]` with success/failure
-**Log Files (log/ folder):**
-- **llm_context_TIMESTAMP.txt** - Full LLM prompts, evidence, answers for debugging
-- **{video_id}_transcript.txt** - Raw transcripts from YouTube/Whisper
 - **Purpose:** Post-run analysis, context preservation, audit trail
-- **Format:** Structured headers with timestamp, question, evidence items, full content
-**Log Format Examples:**
 ```
 [plan] ✓ 660 chars
 [execute] 1 tool(s) selected
 [1/1] youtube_transcript ✓
 [execute] 1 tools, 1 evidence
 [answer] ✓ 3
-Context saved to: log/llm_context_20260113_022706.txt
 ```
 **Note:** Explicit user request overrides global rule about "no logs/ folder"

 ## Logging Standard
+**Unified Log Format (All log files MUST use Markdown):**
+- File extension: `.md` (not `.txt`)
+- Headers: `# Title`, `## Section`, `### Subsection`
+- Metadata: `**Key:** value`
+- Code blocks: Triple backticks with language identifier
+- Token-efficient: Use `##` headings instead of `====` separators (95% token savings)
+**Log File Structure Template:**
+```markdown
+# Log Title
+**Session Start:** YYYY-MM-DDTHH:MM:SS
+**Key:** value
+## Section [timestamp]
+**Question:** ...
+**Evidence items:** N
+### Subsection
+```text
+Content here
+```
+**Result:** value
+## Next Section
+```
 **Console Output (Status Workflow):**
 - **Compressed status updates:** `[node] ✓ result` or `[node] ✗ error`
 - **Progress indicators:** `[1/1] Processing task_id`, `[1/20]` for batch
 - **Key milestones only:** 3-4 statements vs verbose logs
 - **Node labels:** `[plan]`, `[execute]`, `[answer]` with success/failure
+**Log Files (_log/ folder):**
+- `llm_session_*.md` - LLM synthesis session with questions, evidence, responses
+- `{video_id}_transcript.md` - Raw transcripts from YouTube/Whisper
 - **Purpose:** Post-run analysis, context preservation, audit trail
+- **Benefits:** Collapsible headings in editors, token-efficient, readable in plain text
+**Console Format Example:**
 ```
 [plan] ✓ 660 chars
 [execute] 1 tool(s) selected
 [1/1] youtube_transcript ✓
 [execute] 1 tools, 1 evidence
 [answer] ✓ 3
+Session saved to: _log/llm_session_20260113_022706.md
 ```
 **Note:** Explicit user request overrides global rule about "no logs/ folder"

app.py CHANGED Viewed

@@ -421,6 +421,7 @@ def process_single_question(agent, item, index, total):
 def run_and_submit_all(
     llm_provider: str,
     question_limit: int = 0,
     task_ids: str = "",
     profile: gr.OAuthProfile | None = None,
@@ -431,6 +432,7 @@ def run_and_submit_all(
     Args:
         llm_provider: LLM provider to use
         question_limit: Limit number of questions (0 = process all)
         task_ids: Comma-separated task IDs to target (overrides question_limit)
         profile: OAuth profile for HF login
@@ -456,6 +458,10 @@ def run_and_submit_all(
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
     logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
         logger.info("Initializing GAIAAgent...")
@@ -728,6 +734,12 @@ with gr.Blocks() as demo:
                     value="HuggingFace",
                     info="Select which LLM to use for all questions",
                 )
                 eval_question_limit = gr.Number(
                     label="Question Limit (Debug)",
                     value=0,
@@ -760,6 +772,7 @@ with gr.Blocks() as demo:
                 fn=run_and_submit_all,
                 inputs=[
                     eval_llm_provider_dropdown,
                     eval_question_limit,
                     eval_task_ids,
                 ],

 def run_and_submit_all(
     llm_provider: str,
+    video_mode: str = "Transcript",
     question_limit: int = 0,
     task_ids: str = "",
     profile: gr.OAuthProfile | None = None,
     Args:
         llm_provider: LLM provider to use
+        video_mode: YouTube processing mode ("Transcript" or "Frames")
         question_limit: Limit number of questions (0 = process all)
         task_ids: Comma-separated task IDs to target (overrides question_limit)
         profile: OAuth profile for HF login
     os.environ["LLM_PROVIDER"] = llm_provider.lower()
     logger.info(f"UI Config for Full Evaluation: LLM_PROVIDER={llm_provider}")
+    # Set YouTube video processing mode from UI selection
+    os.environ["YOUTUBE_MODE"] = video_mode.lower()
+    logger.info(f"UI Config for Full Evaluation: YOUTUBE_MODE={video_mode}")
     # 1. Instantiate Agent (Stage 1: GAIAAgent with LangGraph)
     try:
         logger.info("Initializing GAIAAgent...")
                     value="HuggingFace",
                     info="Select which LLM to use for all questions",
                 )
+                eval_video_mode = gr.Radio(
+                    label="YouTube Processing Mode",
+                    choices=["Transcript", "Frames"],
+                    value="Transcript",
+                    info="Transcript: Audio/subtitle extraction (fast) | Frames: Visual analysis with vision models (slower)",
+                )
                 eval_question_limit = gr.Number(
                     label="Question Limit (Debug)",
                     value=0,
                 fn=run_and_submit_all,
                 inputs=[
                     eval_llm_provider_dropdown,
+                    eval_video_mode,
                     eval_question_limit,
                     eval_task_ids,
                 ],

brainstorming_phase1_youtube.md DELETED Viewed

@@ -1,446 +0,0 @@
-# Phase 1 Brainstorming - YouTube Transcript Support
-**Date:** 2026-01-13
-**Status:** Discussion Phase
-**Goal:** Fix questions #3 and #5 (YouTube videos) → 40% score
----
-## Question Analysis
-| Question | Task ID                                | Description                     | Expected Answer | Type          |
-| -------- | -------------------------------------- | ------------------------------- | --------------- | ------------- |
-| #3       | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species    | "3"             | Content-based |
-| #5       | (Teal'c quote)                         | YouTube video - character quote | "Extremely"     | Dialogue      |
-**Conclusion:** Both are content-based questions → transcript approach should work ✅
----
-## Library Options
-### Option A: youtube-transcript-api ⭐ Recommended
-- **Pros:** Simple API, actively maintained, no video download needed, fast
-- **Cons:** May fail on videos without captions, regional restrictions
-- **Use case:** Start here for simplicity
-### Option B: yt-dlp + transcript extraction
-- **Pros:** More robust, can fall back to auto-generated captions
-- **Cons:** Heavier dependency, slower
-- **Use case:** Backup if Option A has high failure rate
-### Option C: Direct YouTube API
-- **Pros:** Most control
-- **Cons:** Requires API key, more complex
-- **Use case:** Probably overkill for this use case
----
-## Frame Extraction: Corrected Analysis
-**Key insight:** Frame extraction itself is FAST. The "slow" parts are download + vision API processing.
-### Actual Timing Breakdown
-| Step                 | Time (10-min video) | Notes                                  |
-| -------------------- | ------------------- | -------------------------------------- |
-| **Download**         | 30s - 3 min         | Network I/O, one-time cost             |
-| **Frame extraction** | **5 - 20 sec**      | ffmpeg is I/O bound, very efficient ⚡ |
-| **Vision API calls** | 20s - 5 min         | Sequential: 600 frames × 2-5s each     |
-**Reality check:** You can extract 600 frames from a local 10-min video in under 15 seconds with ffmpeg. The "slow" part is vision model API calls, not the extraction.
-**Bottom line:** Frame extraction is cheap compute. Vision processing is expensive compute.
-### Comparison
-| Approach             | What's Fast        | What's Slow                                 | Total Time       |
-| -------------------- | ------------------ | ------------------------------------------- | ---------------- |
-| **Transcript**       | API call (1-3s)    | -                                           | **1-3 seconds**  |
-| **Frame Extraction** | Extraction (5-20s) | Download (30s-3min) + Vision API (20s-5min) | **1-10 minutes** |
-### Do Tools Matter?
-| Tool    | Speed (extraction only) | Verdict         |
-| ------- | ----------------------- | --------------- |
-| ffmpeg  | ⚡⚡⚡ Fastest (5-10s)  | Best choice     |
-| OpenCV  | ⚡⚡ Fast (10-20s)      | Standard choice |
-| moviepy | ⚡ Medium (20-40s)      | Python overhead |
-**For extraction alone:** Tools matter, but all are fast enough.
-### When Is Frame Extraction Worth It?
-**Only when:**
-- Question is purely visual (no audio/transcript available)
-- Visual information is NOT in video thumbnail/title/description
-- You have no other choice
-**Examples where necessary:**
-- "What color shirt is the person wearing at 2:35?"
-- "Count the number of cars visible in the video"
-- "Describe the visual style of the opening scene"
-**For GAIA #3 and #5:**
-- Both are content-based (species mentioned, dialogue)
-- Transcript is still fastest (1-3s vs 1-10 min total)
-- Frame extraction as fallback is viable (extraction is fast, but vision processing is slow)
-**Decision:** Transcript-first approach is correct. Frame extraction is viable fallback if transcript unavailable, but total time still 1-10 min due to download + vision API.
----
-## Fallback Strategy
-**Scenario:** Video has no transcript available
-**Options:**
-1. **Return error** → LLM treats as system_error, skips question ✅ Simple
-2. **Download + extract frames** → Use vision tool (heavy, slow)
-3. **Return metadata** (title, description) → LLM infers from context
-4. **Chain approach:** Transcript → Metadata → Frames
-**Decision:** Start with audio-to-text fallback (Whisper on ZeroGPU) for higher success rate.
----
-## Audio-to-Text Fallback: When No Transcript Available
-### The Hierarchy
-```
-YouTube URL
-    │
-    ├─ Has transcript? ✅ → Use youtube-transcript-api (instant, 1-3 sec)
-    │
-    └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
-```
-### Whisper Cost Analysis
-| Option          | Cost       | Speed          | Verdict            |
-| --------------- | ---------- | -------------- | ------------------ |
-| OpenAI API      | $0.006/min | ⚡⚡⚡ Fastest | If budget OK       |
-| **Open Source** | **FREE**   | ⚡⚡ Fast      | ⭐ **Recommended** |
-| HuggingFace     | FREE       | ⚡⚡ Fast      | Good alternative   |
-**Decision:** Open-source Whisper (free, no API limits, works offline)
----
-### HF Hardware: ZeroGPU ✅
-| Resource   | Available   | Whisper Requirements      | Verdict                           |
-| ---------- | ----------- | ------------------------- | --------------------------------- |
-| **CPU**    | 4 vCPUs     | 1+ cores                  | ✅ Plenty                         |
-| **Memory** | 16 GB RAM   | 1-10 GB (model-dependent) | ✅ Comfortable                    |
-| **Disk**   | 20 GB       | ~150 MB - 1.5 GB          | ✅ More than enough               |
-| **GPU**    | **ZeroGPU** | Optional (faster)         | ✅ **Available via subscription** |
-**ZeroGPU Benefits:**
-- ✅ Dynamic GPU allocation (5-10x faster than CPU)
-- ✅ Can use larger models (`small`, `medium`) for better accuracy
-- ✅ Still free (subscription benefit)
-**ZeroGPU Requirement:**
-⚠️ **Critical:** ZeroGPU requires `@spaces.GPU` decorator on at least one function.
-**Error without decorator:**
-```
-runtime error: No @spaces.GPU function detected during startup
-```
-**Solution:**
-```python
-from spaces import GPU
-@spaces.GPU  # Required for ZeroGPU
-def transcribe_audio(file_path: str) -> str:
-    # Whisper code here
-    pass
-```
-**How it works:**
-- ZeroGPU scans codebase for `@spaces.GPU` decorator at startup
-- If found: Allocates GPU when function is called
-- If not found: Kills container immediately (no GPU work planned)
-### Performance: CPU vs ZeroGPU
-| Model    | On CPU    | On ZeroGPU    | Speedup |
-| -------- | --------- | ------------- | ------- |
-| `base`   | 30-60 sec | **5-10 sec**  | 5-10x   |
-| `small`  | 1-2 min   | **10-20 sec** | 5-10x   |
-| `medium` | 3-5 min   | **20-40 sec** | 5-10x   |
-**For 5-minute YouTube video on ZeroGPU:**
-- `base` model: ~5-10 seconds ⚡⚡⚡
-- `small` model: ~10-20 seconds ⚡⚡
-### Recommended Model for ZeroGPU
-| Model    | Size   | Accuracy  | Speed (ZeroGPU) | Recommendation         |
-| -------- | ------ | --------- | --------------- | ---------------------- |
-| `tiny`   | 39 MB  | Lower     | ~5 sec          | Fastest, less accurate |
-| `base`   | 74 MB  | Good      | ~10 sec         | Good balance           |
-| `small`  | 244 MB | Better    | ~20 sec         | ⭐ **Recommended**     |
-| `medium` | 769 MB | Very good | ~40 sec         | If accuracy critical   |
-**Choice:** `small` model - best accuracy/speed balance on ZeroGPU
-### Implementation: Audio-to-Text Fallback
-```python
-import whisper
-from spaces import GPU  # Required for ZeroGPU
-_MODEL = None  # Cache model globally
-@spaces.GPU  # Required: ZeroGPU detects this decorator at startup
-def transcribe_audio(file_path: str) -> str:
-    """Transcribe audio file using Whisper (ZeroGPU)."""
-    global _MODEL
-    try:
-        if _MODEL is None:
-            # ZeroGPU auto-detects GPU, no manual device specification
-            _MODEL = whisper.load_model("small")
-        result = _MODEL.transcribe(file_path)
-        return result["text"]
-    except Exception as e:
-        return f"ERROR: Transcription failed: {e}"
-```
----
-### Unified Architecture: Phase 1 + Phase 2
-```
-┌─────────────────────────────────────────────────────────┐
-│                   Audio Transcription                   │
-│              (transcribe_audio function)                │
-│                    Uses Whisper                         │
-│                  on ZeroGPU                             │
-└─────────────────────────────────────────────────────────┘
-                            ▲
-                            │
-        ┌───────────────────┴───────────────────┐
-        │                                       │
-   Phase 1                                Phase 2
-  YouTube URLs                            MP3 Files
-        │                                       │
-        │ 1. Try youtube-transcript-api        │
-        │ 2. Fallback: download audio only      │
-        │ 3. Call transcribe_audio()            │
-        │                                       │
-        └───────────────────┬───────────────────┘
-                            │
-                    Clean transcript
-                            │
-                            ▼
-                      LLM analyzes
-```
-**Benefits:**
-- Single audio processing codebase
-- `transcribe_audio()` works for both phases
-- Tested on HF ZeroGPU hardware
-- Higher success rate than skip-only approach
----
-## Tool Design - LLM Integration
-**Current problem:** Vision tool tries to process YouTube URL → fails
-**Proposed tool description:**
-```
-"Extract transcript from YouTube video URL. Use when question asks about
-YouTube video content like: dialogue, speech, bird species identification,
-character quotes, or any content discussed in the video. Input: YouTube URL.
-Returns: Full transcript text or error message if transcript unavailable."
-```
-**Alternative: Special URL handling in `parse_file()`**
-- Detect YouTube URLs
-- Return tool suggestion: "This is a YouTube URL. Consider using youtube_transcript tool."
----
-## Implementation Considerations
-### A. Video ID Extraction
-Handle various YouTube URL formats:
-- `youtube.com/watch?v=VIDEO_ID`
-- `youtu.be/VIDEO_ID`
-- `youtube.com/shorts/VIDEO_ID`
-### B. Language Handling
-- GAIA questions are English → likely English transcripts
-- Question: Should we auto-translate or let LLM handle?
-### C. Transcript Format
-- Raw JSON with timestamps vs clean text
-- LLM prefers clean text without timestamps
-- Question: Preserve timestamps for context?
-### D. Error Types
-- No transcript available
-- Video private/deleted
-- Rate limiting
-- Regional restriction
----
-## Testing Strategy
-**Before full evaluation:**
-1. **Unit test** - Test on actual GAIA YouTube URLs
-2. **Manual test** - Run single question (#3) to verify LLM uses tool correctly
-3. **Integration test** - Verify transcript → answer pipeline
-**Question:** Do we have access to actual YouTube URLs for pre-testing?
----
-## Edge Cases
-| Scenario                          | Handling                          |
-| --------------------------------- | --------------------------------- |
-| Multiple transcript languages     | Pick English or first available   |
-| Auto-generated transcript         | Accept (less accurate but usable) |
-| YouTube Shorts format             | Extract VIDEO_ID from shorts URL  |
-| Segmented transcript (by speaker) | Clean to plain text               |
----
-## Recommendations
-1. **Start simple:** youtube-transcript-api with clear error messages
-2. **Fail gracefully:** If no transcript, return structured error → system_error=yes
-3. **Tool description:** Emphasize "YouTube video content" for LLM selection
-4. **Manual test first:** Verify on question #3 before full evaluation
-5. **Success metric:** Both questions correct → 40% score ✅ TARGET REACHED
----
-## Open Questions
-- [ ] Implement fallback to frame extraction if transcript fails?
-- [ ] Add special YouTube URL detection in `parse_file()`?
-- [ ] Access to actual YouTube URLs for pre-testing?
-- [ ] Simple first vs comprehensive solution?
----
-## Files to Create
-- `src/tools/audio.py` - Whisper transcription with @spaces.GPU (unified Phase 1+2)
-- `src/tools/youtube.py` - YouTube transcript extraction with audio fallback
-- Update `src/tools/__init__.py` - Register youtube_transcript and transcribe_audio tools
-- Update `requirements.txt` - Add youtube-transcript-api, openai-whisper, yt-dlp
----
-## Industry Validation ✅
-**Overall Assessment:** Approach validated and aligns with industry standards.
-### Core Architecture Validation
-| Component        | Our Approach               | Industry Standard                                 | Status       |
-| ---------------- | -------------------------- | ------------------------------------------------- | ------------ |
-| Primary method   | Transcript-first           | youtube-transcript-api → Whisper fallback         | ✅ Confirmed |
-| Library choice   | youtube-transcript-api     | Widely used (LangChain, CrewAI, 1K+ GitHub repos) | ✅ Standard  |
-| Fallback method  | Whisper on ZeroGPU         | yt-dlp + Whisper (OpenAI API or self-hosted)      | ✅ Optimal   |
-| Frame extraction | Skip for content questions | Only for visual queries                           | ✅ Validated |
-### Key Findings
-**Transcript-First Approach:**
-- LangChain's YoutubeLoader uses youtube-transcript-api as primary
-- CrewAI demonstrates YouTube transcript → Gemini LLM workflow
-- 92% of English tech videos have auto-captions available
-- Industry standard: transcript → LLM pattern
-**Frame Extraction Performance:**
-- ffmpeg decodes at 30-100x realtime speed
-- 10-min video extracts in 5-20 seconds (CPU) ✅ Confirmed
-- Bottleneck is vision API calls, not extraction ✅ Confirmed
-**Vision Processing Costs:**
-| Model | Cost per 600 frames (10-min video) |
-|-------|-----------------------------------|
-| GPT-4o | $1.80-3.60 |
-| Claude 3.5 | $2.16 |
-| Gemini 2.5 Flash | $23.40 |
-**Whisper Fallback:**
-- Industry standard: yt-dlp for audio → Whisper transcription
-- ZeroGPU approach is optimal for HF environment
-- Benchmark: Whisper.cpp transcribes 10-min clips in <90 seconds on M2 MacBook (CPU)
-- ZeroGPU with H200: 5-20 seconds for `small` model ✅ Estimate correct
-### Industry Pattern
-**Standard workflow (validated):**
-1. Try native transcript API (fast, free)
-2. Fallback to audio transcription (Whisper)
-3. Frame extraction only for visual-specific queries
-4. Vision LLM last resort (expensive, slow)
-### Real-World Implementations
-- **Alibaba:** 87 videos processed, Whisper.cpp averaged <90 seconds per 10-min clip
-- **Phantra (GitHub):** YouTube Transcript API → GPT-4o multi-agent system
-- **ytscript toolkit:** Transcript extraction → Claude/ChatGPT analysis
-- **Multiple RAG systems:** Transcript → embeddings → LLM Q&A
-### Final Verdict
-✅ Library choices validated
-✅ Cost analysis accurate
-✅ Performance estimates correct
-✅ Architecture follows best practices
-✅ ZeroGPU setup appropriate
-**No changes needed. Proceed with implementation.**
----
-## Next Steps (Discussion → Implementation)
-1. [x] Confirm approach based on video processing research ✅
-2. [ ] Install youtube-transcript-api and openai-whisper
-3. [ ] Create audio.py with @spaces.GPU decorator (unified Phase 1+2)
-4. [ ] Create youtube.py with transcript extraction + audio fallback
-5. [ ] Add tools to TOOLS registry
-6. [ ] Manual test on question #3
-7. [ ] Full evaluation
-8. [ ] Verify 40% score (4/20 correct)

pyproject.toml CHANGED Viewed

@@ -38,6 +38,9 @@ dependencies = [
     "tenacity>=9.1.2",
     "datasets>=4.4.0",
     "groq>=1.0.0",
 ]
 [tool.uv]

     "tenacity>=9.1.2",
     "datasets>=4.4.0",
     "groq>=1.0.0",
+    "opencv-python>=4.12.0.88",
+    "ipykernel>=7.1.0",
+    "pip>=25.3",
 ]
 [tool.uv]

requirements.txt CHANGED Viewed

@@ -43,7 +43,8 @@ pillow>=10.4.0             # Image files (JPEG, PNG, etc.)
 # Audio/Video processing (Phase 1: YouTube support)
 youtube-transcript-api>=0.6.0  # YouTube transcript extraction
 openai-whisper>=20231117       # Audio transcription ( Whisper)
-yt-dlp>=2024.0.0               # Audio extraction from videos
 # ============================================================================
 # Existing Dependencies (from current app.py)

 # Audio/Video processing (Phase 1: YouTube support)
 youtube-transcript-api>=0.6.0  # YouTube transcript extraction
 openai-whisper>=20231117       # Audio transcription ( Whisper)
+yt-dlp>=2024.0.0               # Audio/video extraction from YouTube
+opencv-python>=4.8.0           # Frame extraction from video
 # ============================================================================
 # Existing Dependencies (from current app.py)

src/agent/llm_client.py CHANGED Viewed

@@ -60,6 +60,7 @@ logger = logging.getLogger(__name__)
 # ============================================================================
 _SESSION_LOG_FILE = None
 def get_session_log_file() -> Path:
@@ -78,25 +79,23 @@ def get_session_log_file() -> Path:
         log_dir = Path("_log")
         log_dir.mkdir(exist_ok=True)
-        # Create session filename with timestamp
         timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
-        _SESSION_LOG_FILE = log_dir / f"llm_session_{timestamp}.txt"
-        # Write session header
         with open(_SESSION_LOG_FILE, "w", encoding="utf-8") as f:
-            f.write("=" * 80 + "\n")
-            f.write("LLM SYNTHESIS SESSION LOG\n")
-            f.write("=" * 80 + "\n")
-            f.write(f"Session Start: {datetime.datetime.now().isoformat()}\n")
-            f.write("=" * 80 + "\n\n")
     return _SESSION_LOG_FILE
 def reset_session_log():
     """Reset session log file (for testing or new evaluation run)."""
-    global _SESSION_LOG_FILE
     _SESSION_LOG_FILE = None
 # ============================================================================
@@ -1124,6 +1123,8 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
 def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
     """Synthesize factoid answer from evidence using HuggingFace Inference API."""
     client = create_hf_client()
     # Format evidence
@@ -1166,32 +1167,37 @@ FINAL ANSWER: 3
 Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
     # ============================================================================
-    # SAVE LLM CONTEXT TO SESSION LOG - Single file per evaluation run
     # ============================================================================
     context_file = get_session_log_file()
-    with open(context_file, "a", encoding="utf-8") as f:
-        f.write("\n" + "=" * 80 + "\n")
-        f.write("QUESTION START\n")
-        f.write("=" * 80 + "\n")
-        f.write(f"Timestamp: {datetime.datetime.now().isoformat()}\n")
-        f.write(f"Question: {question}\n")
-        f.write(f"Evidence items: {len(evidence)}\n")
-        f.write("\n" + "=" * 80 + "\n")
-        f.write("SYSTEM PROMPT:\n")
-        f.write("=" * 80 + "\n")
-        f.write(system_prompt)
-        f.write("\n" + "=" * 80 + "\n")
-        f.write("USER PROMPT:\n")
-        f.write("=" * 80 + "\n")
-        f.write(user_prompt)
-        f.write("\n" + "=" * 80 + "\n")
-        f.write("EVIDENCE ITEMS:\n")
-        f.write("=" * 80 + "\n")
-        for i, ev in enumerate(evidence):
-            f.write(f"\n--- Evidence {i+1}/{len(evidence)} ---\n")
-            f.write(ev)
-        f.write("\n" + "=" * 80 + "\n")
     messages = [
         {"role": "system", "content": system_prompt},
@@ -1218,17 +1224,23 @@ Extract the factoid answer from the evidence above. Return only the factoid, not
     logger.info(f"[synthesize_answer_hf] Answer: {answer}")
-    # Append LLM response to session log (includes reasoning)
     with open(context_file, "a", encoding="utf-8") as f:
-        f.write("\n" + "=" * 80 + "\n")
-        f.write("LLM RESPONSE (with reasoning):\n")
-        f.write("=" * 80 + "\n")
-        f.write(full_response)
-        f.write("\n" + "=" * 80 + "\n")
-        f.write(f"\nEXTRACTED FINAL ANSWER: {answer}\n")
-        f.write("=" * 80 + "\n")
-        f.write("QUESTION END\n")
-        f.write("=" * 80 + "\n")
     return answer

 # ============================================================================
 _SESSION_LOG_FILE = None
+_SYSTEM_PROMPT_WRITTEN = False
 def get_session_log_file() -> Path:
         log_dir = Path("_log")
         log_dir.mkdir(exist_ok=True)
+        # Create session filename with timestamp (use .md for Markdown)
         timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        _SESSION_LOG_FILE = log_dir / f"llm_session_{timestamp}.md"
+        # Write session header in Markdown
         with open(_SESSION_LOG_FILE, "w", encoding="utf-8") as f:
+            f.write("# LLM Synthesis Session Log\n\n")
+            f.write(f"**Session Start:** {datetime.datetime.now().isoformat()}\n\n")
     return _SESSION_LOG_FILE
 def reset_session_log():
     """Reset session log file (for testing or new evaluation run)."""
+    global _SESSION_LOG_FILE, _SYSTEM_PROMPT_WRITTEN
     _SESSION_LOG_FILE = None
+    _SYSTEM_PROMPT_WRITTEN = False
 # ============================================================================
 def synthesize_answer_hf(question: str, evidence: List[str]) -> str:
     """Synthesize factoid answer from evidence using HuggingFace Inference API."""
+    global _SYSTEM_PROMPT_WRITTEN
     client = create_hf_client()
     # Format evidence
 Extract the factoid answer from the evidence above. Return only the factoid, nothing else."""
     # ============================================================================
+    # BUFFER QUESTION CONTEXT - Write complete block atomically after response
     # ============================================================================
     context_file = get_session_log_file()
+    question_timestamp = datetime.datetime.now().isoformat()
+    # Build question header (include system prompt only on first question)
+    system_prompt_section = ""
+    if not _SYSTEM_PROMPT_WRITTEN:
+        system_prompt_section = f"""
+## System Prompt (static - used for all questions)
+```text
+{system_prompt}
+```
+"""
+        _SYSTEM_PROMPT_WRITTEN = True
+    question_header = f"""
+## Question [{question_timestamp}]
+**Question:** {question}
+**Evidence items:** {len(evidence)}
+{system_prompt_section}
+### Evidence & Prompt
+```text
+{user_prompt}
+```
+"""
     messages = [
         {"role": "system", "content": system_prompt},
     logger.info(f"[synthesize_answer_hf] Answer: {answer}")
+    # ============================================================================
+    # WRITE COMPLETE QUESTION BLOCK ATOMICALLY (header + response + end)
+    # ============================================================================
+    complete_block = f"""{question_header}
+### LLM Response
+```text
+{full_response}
+```
+**Extracted Answer:** `{answer}`
+"""
     with open(context_file, "a", encoding="utf-8") as f:
+        f.write(complete_block)
     return answer

src/tools/__init__.py CHANGED Viewed

@@ -82,7 +82,7 @@ TOOLS = {
     },
     "youtube_transcript": {
         "function": youtube_transcript,
-        "description": "Extract transcript from YouTube video URLs (youtube.com, youtu.be, shorts). Use this tool FIRST when question mentions YouTube, video, or contains a YouTube URL. This tool handles video content by extracting the transcript (what is said/discussed in the video). Falls back to Whisper audio transcription if captions are unavailable. This is the ONLY tool that can process YouTube URLs directly.",
         "parameters": {
             "url": {
                 "description": "YouTube video URL (youtube.com/watch?v=ID, youtu.be/ID, or shorts/ID format)",

     },
     "youtube_transcript": {
         "function": youtube_transcript,
+        "description": "Extract transcript from YouTube video URLs (youtube.com, youtu.be, shorts) OR analyze video frames visually. Use this tool FIRST when question mentions YouTube, video, or contains a YouTube URL. This tool handles video content in two modes: (1) Transcript mode extracts what is said/discussed via captions or Whisper fallback, (2) Frame mode extracts and analyzes video frames with vision models. Mode is controlled by YOUTUBE_MODE env variable. This is the ONLY tool that can process YouTube URLs directly.",
         "parameters": {
             "url": {
                 "description": "YouTube video URL (youtube.com/watch?v=ID, youtu.be/ID, or shorts/ID format)",

src/tools/youtube.py CHANGED Viewed

@@ -1,23 +1,29 @@
 """
-YouTube Transcript Tool - Extract transcripts from YouTube videos
 Author: @mangobee
 Date: 2026-01-13
-Provides YouTube video transcript extraction:
-- Primary: youtube-transcript-api (instant, 1-3 seconds)
-- Fallback: yt-dlp audio extraction + Whisper transcription (30s-2min)
-- Handles various YouTube URL formats (watch, youtu.be, shorts)
-- Returns clean transcript text for LLM analysis
-Workflow:
     YouTube URL
         ├─ Has transcript? ✅ → Use youtube-transcript-api (instant)
         └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
 Requirements:
 - youtube-transcript-api: pip install youtube-transcript-api
 - yt-dlp: pip install yt-dlp
-- openai-whisper: pip install openai-whisper (via src.tools.audio)
 """
 import logging
@@ -39,6 +45,10 @@ YOUTUBE_PATTERNS = [
 AUDIO_FORMAT = "mp3"
 AUDIO_QUALITY = "128"  # 128 kbps (sufficient for speech)
 # Temporary file cleanup
 CLEANUP_TEMP_FILES = True
@@ -54,7 +64,7 @@ logger = logging.getLogger(__name__)
 def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
     """
-    Save transcript to log/ folder for debugging.
     Args:
         video_id: YouTube video ID
@@ -65,14 +75,15 @@ def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
         log_dir = Path("_log")
         log_dir.mkdir(exist_ok=True)
-        cache_file = log_dir / f"{video_id}_transcript.txt"
         with open(cache_file, "w", encoding="utf-8") as f:
-            f.write(f"# YouTube Transcript\n")
-            f.write(f"# Video ID: {video_id}\n")
-            f.write(f"# Source: {source}\n")
-            f.write(f"# Length: {len(text)} characters\n")
-            f.write(f"# Generated: {__import__('datetime').datetime.now().isoformat()}\n")
-            f.write(f"\n{text}\n")
         logger.info(f"Transcript saved: {cache_file}")
     except Exception as e:
@@ -343,35 +354,329 @@ def transcribe_from_audio(video_url: str) -> Dict[str, Any]:
         }
 # ============================================================================
 # Main API Function
 # =============================================================================
-def youtube_transcript(url: str) -> Dict[str, Any]:
     """
-    Extract transcript from YouTube video.
-    Primary method: youtube-transcript-api (instant)
-    Fallback method: Download audio + Whisper transcription (slower)
     Args:
         url: YouTube video URL (youtube.com, youtu.be, shorts)
     Returns:
         Dict with structure: {
-            "text": str,           # Transcript text
             "video_id": str,       # Video ID
-            "source": str,         # "api" or "whisper"
-            "success": bool,       # True if transcription succeeded
             "error": str or None   # Error message if failed
         }
     Raises:
-        ValueError: If URL is not a valid YouTube URL
     Examples:
-        >>> youtube_transcript("https://youtube.com/watch?v=dQw4w9WgXcQ")
         {"text": "Never gonna give you up...", "video_id": "dQw4w9WgXcQ", "source": "api", "success": True, "error": None}
     """
     # Validate URL and extract video ID
     video_id = extract_video_id(url)
@@ -386,26 +691,71 @@ def youtube_transcript(url: str) -> Dict[str, Any]:
             "error": f"Invalid YouTube URL: {url}"
         }
-    logger.info(f"Processing YouTube video: {video_id}")
-    # Try transcript API first (fast)
-    result = get_youtube_transcript(video_id)
-    if result["success"]:
-        logger.info(f"Transcript retrieved via API: {len(result['text'])} characters")
-        # Log transcript to file for debugging
-        logger.info(f"Transcript content: {result['text'][:200]}...")
         return result
-    # Fallback to audio transcription (slow but works)
-    logger.info(f"Transcript API failed, trying audio transcription...")
-    result = transcribe_from_audio(url)
-    if result["success"]:
-        logger.info(f"Transcript retrieved via Whisper: {len(result['text'])} characters")
-        # Log full transcript for debugging
-        logger.info(f"Full transcript: {result['text']}")
-    else:
-        logger.error(f"All transcript methods failed for video: {video_id}")
-    return result

 """
+YouTube Video Analysis Tool - Extract transcripts or analyze frames from YouTube videos
 Author: @mangobee
 Date: 2026-01-13
+Provides two modes for YouTube video analysis:
+- Transcript Mode: youtube-transcript-api (instant, 1-3 seconds) or Whisper fallback
+- Frame Mode: Extract video frames and analyze with vision models
+Transcript Mode Workflow:
     YouTube URL
         ├─ Has transcript? ✅ → Use youtube-transcript-api (instant)
         └─ No transcript? ❌ → Download audio + Whisper (slower, but works)
+Frame Mode Workflow:
+    YouTube URL
+        ├─ Download video with yt-dlp
+        ├─ Extract N frames at regular intervals
+        └─ Analyze frames with vision models (summarize findings)
 Requirements:
 - youtube-transcript-api: pip install youtube-transcript-api
 - yt-dlp: pip install yt-dlp
+- openai: pip install openai (via src.tools.audio)
+- opencv-python: pip install opencv-python (for frame extraction)
+- PIL: pip install Pillow (for image handling)
 """
 import logging
 AUDIO_FORMAT = "mp3"
 AUDIO_QUALITY = "128"  # 128 kbps (sufficient for speech)
+# Frame extraction settings
+FRAME_COUNT = 6  # Number of frames to extract
+FRAME_QUALITY = "worst"  # YouTube-dl format quality for frame extraction (worst = faster download)
 # Temporary file cleanup
 CLEANUP_TEMP_FILES = True
 def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
     """
+    Save transcript to _log/ folder for debugging.
     Args:
         video_id: YouTube video ID
         log_dir = Path("_log")
         log_dir.mkdir(exist_ok=True)
+        cache_file = log_dir / f"{video_id}_transcript.md"
         with open(cache_file, "w", encoding="utf-8") as f:
+            f.write(f"# YouTube Transcript\n\n")
+            f.write(f"**Video ID:** {video_id}\n")
+            f.write(f"**Source:** {source}\n")
+            f.write(f"**Length:** {len(text)} characters\n")
+            f.write(f"**Generated:** {__import__('datetime').datetime.now().isoformat()}\n\n")
+            f.write(f"## Transcript\n\n")
+            f.write(f"{text}\n")
         logger.info(f"Transcript saved: {cache_file}")
     except Exception as e:
         }
+# ============================================================================
+# Frame Processing (Video Analysis Mode)
+# =============================================================================
+def download_video(url: str) -> Optional[str]:
+    """
+    Download video from YouTube using yt-dlp for frame extraction.
+    Args:
+        url: Full YouTube URL
+    Returns:
+        Path to downloaded video file or None if failed
+    """
+    try:
+        import yt_dlp
+        logger.info(f"Downloading video from: {url}")
+        # Create temp file for video
+        temp_dir = tempfile.gettempdir()
+        output_path = os.path.join(temp_dir, f"youtube_video_{os.getpid()}")
+        # yt-dlp options: video only, lowest quality (faster for frame extraction)
+        ydl_opts = {
+            'format': f'best[ext=mp4]/best',
+            'outtmpl': output_path,
+            'quiet': True,
+            'no_warnings': True,
+        }
+        with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+            ydl.download([url])
+        # Find the downloaded file (yt-dlp adds extension)
+        for file in os.listdir(temp_dir):
+            if file.startswith(f"youtube_video_{os.getpid()}"):
+                actual_path = os.path.join(temp_dir, file)
+                size_mb = os.path.getsize(actual_path) / (1024 * 1024)
+                logger.info(f"Video downloaded: {actual_path} ({size_mb:.2f}MB)")
+                return actual_path
+        logger.error("Video file not found after download")
+        return None
+    except ImportError:
+        logger.error("yt-dlp not installed. Run: pip install yt-dlp")
+        return None
+    except Exception as e:
+        logger.error(f"Video download failed: {e}")
+        return None
+def extract_frames(video_path: str, count: int = FRAME_COUNT) -> list:
+    """
+    Extract frames from video at regular intervals.
+    Args:
+        video_path: Path to video file
+        count: Number of frames to extract (default: FRAME_COUNT)
+    Returns:
+        List of (frame_path, timestamp) tuples
+    """
+    try:
+        import cv2
+        cap = cv2.VideoCapture(video_path)
+        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        duration = total_frames / fps if fps > 0 else 0
+        logger.info(f"Video: {total_frames} frames, {fps:.2f} FPS, {duration:.2f}s duration")
+        # Calculate frame indices at regular intervals
+        if total_frames <= count:
+            frame_indices = list(range(total_frames))
+        else:
+            interval = total_frames / count
+            frame_indices = [int(i * interval) for i in range(count)]
+        logger.info(f"Extracting {len(frame_indices)} frames at indices: {frame_indices[:3]}...")
+        frames = []
+        temp_dir = tempfile.gettempdir()
+        for idx, frame_idx in enumerate(frame_indices):
+            cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)
+            ret, frame = cap.read()
+            if ret:
+                timestamp = frame_idx / fps if fps > 0 else 0
+                frame_path = os.path.join(temp_dir, f"frame_{os.getpid()}_{idx}.jpg")
+                cv2.imwrite(frame_path, frame)
+                frames.append((frame_path, timestamp))
+                logger.debug(f"Frame {idx}: {timestamp:.2f}s -> {frame_path}")
+            else:
+                logger.warning(f"Failed to extract frame at index {frame_idx}")
+        cap.release()
+        logger.info(f"Extracted {len(frames)} frames")
+        return frames
+    except ImportError:
+        logger.error("opencv-python not installed. Run: pip install opencv-python")
+        return []
+    except Exception as e:
+        logger.error(f"Frame extraction failed: {e}")
+        return []
+def analyze_frames(frames: list, question: str = None) -> Dict[str, Any]:
+    """
+    Analyze video frames using vision models.
+    Args:
+        frames: List of (frame_path, timestamp) tuples
+        question: Optional question to ask about frames
+    Returns:
+        Dict with structure: {
+            "text": str,           # Summarized analysis
+            "video_id": str,       # Video ID (placeholder)
+            "source": str,         # "frames"
+            "success": bool,       # True if analysis succeeded
+            "error": str or None   # Error message if failed
+            "frame_count": int,    # Number of frames analyzed
+        }
+    """
+    from src.tools.vision import analyze_image
+    if not frames:
+        return {
+            "text": "",
+            "video_id": "",
+            "source": "frames",
+            "success": False,
+            "error": "No frames to analyze",
+            "frame_count": 0,
+        }
+    # Default question for frame analysis
+    if not question:
+        question = "Describe what you see in this frame. Include any visible text, objects, people, or actions."
+    try:
+        logger.info(f"Analyzing {len(frames)} frames with vision model...")
+        frame_analyses = []
+        for idx, (frame_path, timestamp) in enumerate(frames):
+            logger.info(f"Analyzing frame {idx + 1}/{len(frames)} at {timestamp:.2f}s...")
+            # Customize question with timestamp context
+            frame_question = f"This is frame {idx + 1} of {len(frames)} from a video at timestamp {timestamp:.2f} seconds. {question}"
+            try:
+                result = analyze_image(frame_path, frame_question)
+                answer = result.get("answer", "")
+                # Add timestamp context
+                frame_analyses.append(f"[Frame {idx + 1} @ {timestamp:.2f}s]\n{answer}")
+                logger.info(f"Frame {idx + 1} analyzed: {len(answer)} chars")
+            except Exception as e:
+                logger.warning(f"Frame {idx + 1} analysis failed: {e}")
+                frame_analyses.append(f"[Frame {idx + 1} @ {timestamp:.2f}s]\nAnalysis failed: {str(e)}")
+        # Cleanup frame files
+        if CLEANUP_TEMP_FILES:
+            for frame_path, _ in frames:
+                try:
+                    os.remove(frame_path)
+                except Exception as e:
+                    logger.warning(f"Failed to cleanup frame {frame_path}: {e}")
+        # Combine all frame analyses
+        combined_text = "\n\n".join(frame_analyses)
+        logger.info(f"Frame analysis complete: {len(combined_text)} chars total")
+        return {
+            "text": combined_text,
+            "video_id": "",
+            "source": "frames",
+            "success": True,
+            "error": None,
+            "frame_count": len(frames),
+        }
+    except Exception as e:
+        logger.error(f"Frame analysis failed: {e}")
+        return {
+            "text": "",
+            "video_id": "",
+            "source": "frames",
+            "success": False,
+            "error": f"Frame analysis failed: {str(e)}",
+            "frame_count": len(frames),
+        }
+def process_video_frames(url: str, question: str = None, frame_count: int = FRAME_COUNT) -> Dict[str, Any]:
+    """
+    Download video, extract frames, and analyze with vision models.
+    Args:
+        url: Full YouTube URL
+        question: Optional question to ask about frames
+        frame_count: Number of frames to extract
+    Returns:
+        Dict with structure: {
+            "text": str,           # Combined frame analyses
+            "video_id": str,       # Video ID
+            "source": str,         # "frames"
+            "success": bool,       # True if processing succeeded
+            "error": str or None   # Error message if failed
+            "frame_count": int     # Number of frames analyzed
+        }
+    """
+    video_id = extract_video_id(url)
+    if not video_id:
+        return {
+            "text": "",
+            "video_id": "",
+            "source": "frames",
+            "success": False,
+            "error": "Invalid YouTube URL",
+            "frame_count": 0,
+        }
+    # Download video
+    video_file = download_video(url)
+    if not video_file:
+        return {
+            "text": "",
+            "video_id": video_id,
+            "source": "frames",
+            "success": False,
+            "error": "Failed to download video",
+            "frame_count": 0,
+        }
+    try:
+        # Extract frames
+        frames = extract_frames(video_file, frame_count)
+        if not frames:
+            return {
+                "text": "",
+                "video_id": video_id,
+                "source": "frames",
+                "success": False,
+                "error": "Failed to extract frames",
+                "frame_count": 0,
+            }
+        # Analyze frames
+        result = analyze_frames(frames, question)
+        # Cleanup temp video file
+        if CLEANUP_TEMP_FILES:
+            try:
+                os.remove(video_file)
+                logger.info(f"Cleaned up temp video: {video_file}")
+            except Exception as e:
+                logger.warning(f"Failed to cleanup temp video: {e}")
+        # Add video_id to result
+        result["video_id"] = video_id
+        return result
+    except Exception as e:
+        logger.error(f"Video frame processing failed: {e}")
+        return {
+            "text": "",
+            "video_id": video_id,
+            "source": "frames",
+            "success": False,
+            "error": f"Video processing failed: {str(e)}",
+            "frame_count": 0,
+        }
 # ============================================================================
 # Main API Function
 # =============================================================================
+def youtube_analyze(url: str, mode: str = "transcript") -> Dict[str, Any]:
     """
+    Analyze YouTube video using transcript or frame processing mode.
+    Transcript Mode: Extract transcript (youtube-transcript-api or Whisper)
+    Frame Mode: Extract frames and analyze with vision models
     Args:
         url: YouTube video URL (youtube.com, youtu.be, shorts)
+        mode: Analysis mode - "transcript" (default) or "frames"
     Returns:
         Dict with structure: {
+            "text": str,           # Transcript or frame analyses
             "video_id": str,       # Video ID
+            "source": str,         # "api", "whisper", or "frames"
+            "success": bool,       # True if analysis succeeded
             "error": str or None   # Error message if failed
+            "frame_count": int     # Number of frames (frame mode only)
         }
     Raises:
+        ValueError: If URL is not valid or mode is invalid
     Examples:
+        >>> youtube_analyze("https://youtube.com/watch?v=dQw4w9WgXcQ", mode="transcript")
         {"text": "Never gonna give you up...", "video_id": "dQw4w9WgXcQ", "source": "api", "success": True, "error": None}
+        >>> youtube_analyze("https://youtube.com/watch?v=dQw4w9WgXcQ", mode="frames")
+        {"text": "[Frame 1 @ 0.00s]\nA man...", "video_id": "dQw4w9WgXcQ", "source": "frames", "success": True, "frame_count": 6, "error": None}
     """
     # Validate URL and extract video ID
     video_id = extract_video_id(url)
             "error": f"Invalid YouTube URL: {url}"
         }
+    # Validate mode
+    mode = mode.lower()
+    if mode not in ("transcript", "frames"):
+        logger.error(f"Invalid mode: {mode}")
+        return {
+            "text": "",
+            "video_id": video_id,
+            "source": "none",
+            "success": False,
+            "error": f"Invalid mode: {mode}. Valid: transcript, frames"
+        }
+    logger.info(f"Processing YouTube video: {video_id} (mode: {mode})")
+    # Route to appropriate processing mode
+    if mode == "frames":
+        # Frame processing mode
+        result = process_video_frames(url)
+        if result["success"]:
+            logger.info(f"Frame analysis complete: {result.get('frame_count', 0)} frames, {len(result['text'])} chars")
+        return result
+    else:  # mode == "transcript"
+        # Transcript mode: Try API first, fallback to Whisper
+        result = get_youtube_transcript(video_id)
+        if result["success"]:
+            logger.info(f"Transcript retrieved via API: {len(result['text'])} characters")
+            logger.info(f"Transcript content: {result['text'][:200]}...")
+            return result
+        # Fallback to audio transcription (slow but works)
+        logger.info(f"Transcript API failed, trying audio transcription...")
+        result = transcribe_from_audio(url)
+        if result["success"]:
+            logger.info(f"Transcript retrieved via Whisper: {len(result['text'])} characters")
+            logger.info(f"Full transcript: {result['text']}")
+        else:
+            logger.error(f"All transcript methods failed for video: {video_id}")
         return result
+# Backward compatibility wrapper that respects YOUTUBE_MODE environment variable
+def youtube_transcript(url: str) -> Dict[str, Any]:
+    """
+    Wrapper for youtube_analyze that respects YOUTUBE_MODE environment variable.
+    This allows the agent to switch between transcript and frame modes
+    without changing the function signature used in the graph.
+    Mode selection:
+    - YOUTUBE_MODE env variable (set by UI): "transcript" or "frames"
+    - Default: "transcript" (backward compatible)
+    Args:
+        url: YouTube video URL
+    Returns:
+        Dict with structure from youtube_analyze()
+    """
+    # Read mode from environment variable (set by app.py UI)
+    mode = os.getenv("YOUTUBE_MODE", "transcript").lower()
+    logger.info(f"youtube_transcript called with YOUTUBE_MODE={mode}")
+    return youtube_analyze(url, mode=mode)