feat: add transcript caching and upgrade synthesis model
Browse files- Add save_transcript_to_cache() to save transcripts for debugging
- Upgrade LLM: Qwen 2.5 → openai/gpt-oss-120b (Scaleway)
- Document HF provider suffix behavior (auto-routing is bad practice)
- Model iteration: Qwen 2.5 → Llama 3.3 → gpt-oss-120b
Co-Authored-By: Claude <noreply@anthropic.com>
- CHANGELOG.md +249 -69
- src/agent/llm_client.py +3 -2
- src/tools/youtube.py +37 -0
CHANGELOG.md
CHANGED
|
@@ -1,5 +1,134 @@
|
|
| 1 |
# Session Changelog
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
|
| 4 |
|
| 5 |
**Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
|
|
@@ -7,13 +136,15 @@
|
|
| 7 |
**Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
|
| 8 |
|
| 9 |
**Modified Files:**
|
|
|
|
| 10 |
- **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
|
| 11 |
- **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
|
| 12 |
-
- **src/tools
|
| 13 |
- **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
|
| 14 |
- **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
|
| 15 |
|
| 16 |
**Key Technical Decisions:**
|
|
|
|
| 17 |
- **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
|
| 18 |
- **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
|
| 19 |
- **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
|
|
@@ -21,15 +152,11 @@
|
|
| 21 |
- **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
|
| 22 |
|
| 23 |
**Expected Impact:**
|
|
|
|
| 24 |
- Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
|
| 25 |
- Score: 10% → 40% (2/20 → 4/20 correct)
|
| 26 |
- **Target achieved:** Exceeds 30% requirement (6/20)
|
| 27 |
|
| 28 |
-
**Next Steps:**
|
| 29 |
-
- Test on question #3 (bird species)
|
| 30 |
-
- Run full evaluation
|
| 31 |
-
- If successful, implement Phase 2 (MP3 audio support)
|
| 32 |
-
|
| 33 |
---
|
| 34 |
|
| 35 |
## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
|
|
@@ -40,43 +167,43 @@
|
|
| 40 |
|
| 41 |
### Fixed (Course API Contract - DO NOT CHANGE)
|
| 42 |
|
| 43 |
-
| Aspect
|
| 44 |
-
|
| 45 |
-
| **API Endpoint**
|
| 46 |
-
| **Questions Route**
|
| 47 |
-
| **Submit Route**
|
| 48 |
-
| **Number of Questions** | **20** (always 20)
|
| 49 |
-
| **Question Source**
|
| 50 |
-
| **Randomness**
|
| 51 |
-
| **Difficulty**
|
| 52 |
-
| **Filter Criteria**
|
| 53 |
-
| **Scoring**
|
| 54 |
-
| **Target Score**
|
| 55 |
|
| 56 |
### The 20 Questions (ALWAYS the Same)
|
| 57 |
|
| 58 |
-
| #
|
| 59 |
-
|
| 60 |
-
| 1
|
| 61 |
-
| 2
|
| 62 |
-
| 3
|
| 63 |
-
| 4
|
| 64 |
-
| 5
|
| 65 |
-
| 6
|
| 66 |
-
| 7
|
| 67 |
-
| 8
|
| 68 |
-
| 9
|
| 69 |
-
| 10
|
| 70 |
-
| 11
|
| 71 |
-
| 12
|
| 72 |
-
| 13
|
| 73 |
-
| 14
|
| 74 |
-
| 15
|
| 75 |
-
| 16
|
| 76 |
-
| 17
|
| 77 |
-
| 18
|
| 78 |
-
| 19
|
| 79 |
-
| 20
|
| 80 |
|
| 81 |
**NOT random** - same 20 questions every submission!
|
| 82 |
|
|
@@ -96,12 +223,12 @@ submission_data = {
|
|
| 96 |
|
| 97 |
### Our Additions (SAFE to Modify)
|
| 98 |
|
| 99 |
-
| Feature
|
| 100 |
-
|
| 101 |
-
| Question Limit
|
| 102 |
-
| Target Task IDs
|
| 103 |
-
| ThreadPoolExecutor | Speed: concurrent
|
| 104 |
-
| System Error Field | UX: error tracking
|
| 105 |
| File Download (HF) | Feature: support files | ✅ Optional |
|
| 106 |
|
| 107 |
### Key Learnings
|
|
@@ -121,23 +248,27 @@ submission_data = {
|
|
| 121 |
**Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
|
| 122 |
|
| 123 |
**Process:**
|
|
|
|
| 124 |
1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
|
| 125 |
2. Removed git-specific files (`.git/` folder, `.gitattributes`)
|
| 126 |
3. Copied to project as `_template_original/` (static reference, no git)
|
| 127 |
4. Cleaned up temporary clone from Downloads
|
| 128 |
|
| 129 |
**Why Static Reference:**
|
|
|
|
| 130 |
- No `.git/` folder → won't interfere with project's git
|
| 131 |
- No `.gitattributes` → clean file comparison
|
| 132 |
- Pure reference material for diff/comparison
|
| 133 |
- Can see exactly what changed from original
|
| 134 |
|
| 135 |
**Template Original Contents:**
|
|
|
|
| 136 |
- `app.py` (8777 bytes - original)
|
| 137 |
- `README.md` (400 bytes - original)
|
| 138 |
- `requirements.txt` (15 bytes - original)
|
| 139 |
|
| 140 |
**Comparison Commands:**
|
|
|
|
| 141 |
```bash
|
| 142 |
# Compare file sizes
|
| 143 |
ls -lh _template_original/app.py app.py
|
|
@@ -150,7 +281,8 @@ wc -l app.py _template_original/app.py
|
|
| 150 |
```
|
| 151 |
|
| 152 |
**Created Files:**
|
| 153 |
-
|
|
|
|
| 154 |
|
| 155 |
---
|
| 156 |
|
|
@@ -159,6 +291,7 @@ wc -l app.py _template_original/app.py
|
|
| 159 |
**Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
|
| 160 |
|
| 161 |
**Actions Taken:**
|
|
|
|
| 162 |
1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
|
| 163 |
2. Updated local git remote to point to new URL
|
| 164 |
3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
|
|
@@ -166,12 +299,14 @@ wc -l app.py _template_original/app.py
|
|
| 166 |
5. Pushed commits to renamed Space: `c86df49..41ac444`
|
| 167 |
|
| 168 |
**Key Learnings:**
|
|
|
|
| 169 |
- Local folder name ≠ git repo identity (can rename locally without affecting remote)
|
| 170 |
- Git remote URL determines push destination (updated to `agentbee`)
|
| 171 |
- HuggingFace Space name is independent of local folder name
|
| 172 |
- All work preserved through rename process
|
| 173 |
|
| 174 |
**Current State:**
|
|
|
|
| 175 |
- Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
|
| 176 |
- Remote: `mangubee/agentbee` (renamed on HuggingFace)
|
| 177 |
- Sync: ✅ All changes pushed
|
|
@@ -187,6 +322,7 @@ wc -l app.py _template_original/app.py
|
|
| 187 |
**Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
|
| 188 |
|
| 189 |
**Solution:** Created `docs/gaia_submission_guide.md` documenting:
|
|
|
|
| 190 |
- **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
|
| 191 |
- **Official GAIA Leaderboard** (future): 450+ questions, different submission format
|
| 192 |
- API routes, submission formats, scoring differences
|
|
@@ -202,9 +338,11 @@ wc -l app.py _template_original/app.py
|
|
| 202 |
| Submission | JSON POST | File upload |
|
| 203 |
|
| 204 |
**Created Files:**
|
|
|
|
| 205 |
- **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
|
| 206 |
|
| 207 |
**Modified Files:**
|
|
|
|
| 208 |
- **README.md** - Added note linking to submission guide
|
| 209 |
|
| 210 |
---
|
|
@@ -216,6 +354,7 @@ wc -l app.py _template_original/app.py
|
|
| 216 |
**Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
|
| 217 |
|
| 218 |
**Implementation:**
|
|
|
|
| 219 |
- Added `eval_task_ids` textbox in UI (line 763-770)
|
| 220 |
- Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
|
| 221 |
- Filtering logic: Parses comma-separated IDs, filters `questions_data`
|
|
@@ -223,11 +362,13 @@ wc -l app.py _template_original/app.py
|
|
| 223 |
- Overrides question_limit when provided
|
| 224 |
|
| 225 |
**Usage:**
|
|
|
|
| 226 |
```
|
| 227 |
Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
|
| 228 |
```
|
| 229 |
|
| 230 |
**Modified Files:**
|
|
|
|
| 231 |
- **app.py** (~30 lines added)
|
| 232 |
- UI: `eval_task_ids` textbox
|
| 233 |
- `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
|
|
@@ -244,6 +385,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 244 |
**Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
|
| 245 |
|
| 246 |
**Modified Files:**
|
|
|
|
| 247 |
- **src/tools/calculator.py** (~15 lines modified)
|
| 248 |
- `timeout()` context manager: Try/except for signal.alarm() failure
|
| 249 |
- Logs warning when timeout protection disabled
|
|
@@ -256,17 +398,20 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 256 |
**Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
|
| 257 |
|
| 258 |
**Solution:** Changed to boolean `system_error: yes/no` field:
|
|
|
|
| 259 |
- `system_error: yes` - Technical/system error from our code (don't submit)
|
| 260 |
- `system_error: no` - AI response (submit answer, even if wrong)
|
| 261 |
- Added `error_log` field with full error details for system errors
|
| 262 |
|
| 263 |
**Implementation:**
|
|
|
|
| 264 |
- `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
|
| 265 |
- Results table: "System Error" column (yes/no), "Error Log" column (when yes)
|
| 266 |
- JSON export: `system_error` field, `error_log` field (when system error)
|
| 267 |
- Submission logic: Only submit when `system_error == "no"`
|
| 268 |
|
| 269 |
**Modified Files:**
|
|
|
|
| 270 |
- **app.py** (~30 lines modified)
|
| 271 |
- `a_determine_status()`: Returns tuple instead of string
|
| 272 |
- `process_single_question()`: Uses new format, adds `error_log`
|
|
@@ -280,6 +425,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 280 |
**Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
|
| 281 |
|
| 282 |
**Solution:** Removed all fallback-related UI elements:
|
|
|
|
| 283 |
- Removed `enable_fallback_checkbox` from Test Question tab
|
| 284 |
- Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
|
| 285 |
- Removed `enable_fallback` parameter from `test_single_question()` function
|
|
@@ -288,6 +434,7 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 288 |
- Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
|
| 289 |
|
| 290 |
**Modified Files:**
|
|
|
|
| 291 |
- **app.py** (~20 lines removed)
|
| 292 |
- Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
|
| 293 |
- Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
|
|
@@ -299,24 +446,28 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 299 |
## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
|
| 300 |
|
| 301 |
**Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
|
|
|
|
| 302 |
- 4 providers to test for each feature
|
| 303 |
- Complex debugging with multiple code paths
|
| 304 |
- Longer, less clear error messages
|
| 305 |
- Adding complexity without clear benefit
|
| 306 |
|
| 307 |
**Solution:** Archive fallback mechanism, use single provider only
|
|
|
|
| 308 |
- Removed fallback provider loop (Gemini → HF → Groq → Claude)
|
| 309 |
- Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
|
| 310 |
- If provider fails, error is raised immediately
|
| 311 |
- Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
|
| 312 |
|
| 313 |
**Benefits:**
|
|
|
|
| 314 |
- ✅ Reduced code complexity
|
| 315 |
- ✅ Faster debugging (one code path)
|
| 316 |
- ✅ Clearer error messages
|
| 317 |
- ✅ No double work on features
|
| 318 |
|
| 319 |
**Modified Files:**
|
|
|
|
| 320 |
- **src/agent/llm_client.py** (~25 lines removed)
|
| 321 |
- Simplified `_call_with_fallback()`: Removed fallback logic
|
| 322 |
- **dev/dev_260112_02_fallback_archived.md** (NEW)
|
|
@@ -330,11 +481,13 @@ Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b
|
|
| 330 |
**Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
|
| 331 |
|
| 332 |
**Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
|
|
|
|
| 333 |
```python
|
| 334 |
{"results": [...], "source": "tavily", "query": "...", "count": N}
|
| 335 |
```
|
| 336 |
|
| 337 |
**Solution:** Handle both dict formats in evidence extraction:
|
|
|
|
| 338 |
```python
|
| 339 |
if isinstance(result, dict):
|
| 340 |
if "answer" in result:
|
|
@@ -349,6 +502,7 @@ if isinstance(result, dict):
|
|
| 349 |
```
|
| 350 |
|
| 351 |
**Modified Files:**
|
|
|
|
| 352 |
- **src/agent/graph.py** (~40 lines modified)
|
| 353 |
- Updated evidence extraction in primary path
|
| 354 |
- Updated evidence extraction in fallback path
|
|
@@ -356,6 +510,7 @@ if isinstance(result, dict):
|
|
| 356 |
**Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
|
| 357 |
|
| 358 |
**Summary of Fixes (Session 2026-01-12):**
|
|
|
|
| 359 |
1. ✅ File download from HF dataset (5/5 files)
|
| 360 |
2. ✅ Absolute paths from script location
|
| 361 |
3. ✅ Evidence formatting for vision tools (dict → answer)
|
|
@@ -370,6 +525,7 @@ if isinstance(result, dict):
|
|
| 370 |
**Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
|
| 371 |
|
| 372 |
**Solution:** Extract 'answer' field from dict results before adding to evidence:
|
|
|
|
| 373 |
```python
|
| 374 |
# Before
|
| 375 |
evidence.append(f"[{tool_name}] {result}") # Dict → string representation
|
|
@@ -382,6 +538,7 @@ elif isinstance(result, str):
|
|
| 382 |
```
|
| 383 |
|
| 384 |
**Modified Files:**
|
|
|
|
| 385 |
- **src/agent/graph.py** (~15 lines modified)
|
| 386 |
- Updated `execute_node()`: Extract 'answer' from dict results
|
| 387 |
- Fixed both primary and fallback execution paths
|
|
@@ -399,11 +556,13 @@ elif isinstance(result, str):
|
|
| 399 |
**Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
|
| 400 |
|
| 401 |
**Solution:** Return absolute paths from `download_task_file()`
|
|
|
|
| 402 |
- Changed: `target_path = os.path.join(save_dir, file_name)`
|
| 403 |
- To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
|
| 404 |
- Now tools can find files regardless of working directory
|
| 405 |
|
| 406 |
**Modified Files:**
|
|
|
|
| 407 |
- **app.py** (~3 lines modified)
|
| 408 |
- Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
|
| 409 |
|
|
@@ -416,6 +575,7 @@ elif isinstance(result, str):
|
|
| 416 |
**Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
|
| 417 |
|
| 418 |
**Investigation:**
|
|
|
|
| 419 |
- Checked API spec: Endpoint exists with proper documentation
|
| 420 |
- Tested download: HTTP 404 "No file path associated with task_id"
|
| 421 |
- Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
|
|
@@ -424,12 +584,14 @@ elif isinstance(result, str):
|
|
| 424 |
**Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
|
| 425 |
|
| 426 |
**Solution:** Switch from evaluation API to GAIA dataset download
|
|
|
|
| 427 |
- Use `huggingface_hub.hf_hub_download()` to fetch files
|
| 428 |
- Download to `_cache/gaia_files/` (runtime cache)
|
| 429 |
- File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
|
| 430 |
- Added cache checking (reuse downloaded files)
|
| 431 |
|
| 432 |
**Files with attachments (5/20 questions):**
|
|
|
|
| 433 |
- `cca530fc`: Chess position image (.png)
|
| 434 |
- `99c9cc74`: Pie recipe audio (.mp3)
|
| 435 |
- `f918266a`: Python code (.py)
|
|
@@ -437,6 +599,7 @@ elif isinstance(result, str):
|
|
| 437 |
- `7bd855d8`: Menu sales Excel (.xlsx)
|
| 438 |
|
| 439 |
**Modified Files:**
|
|
|
|
| 440 |
- **app.py** (~70 lines modified)
|
| 441 |
- Updated `download_task_file()`: Changed from evaluation API to HF dataset download
|
| 442 |
- Changed signature: `download_task_file(task_id, file_name, save_dir)`
|
|
@@ -447,30 +610,34 @@ elif isinstance(result, str):
|
|
| 447 |
- Updated `process_single_question()`: Pass `file_name` to download function
|
| 448 |
|
| 449 |
**Known Limitations:**
|
|
|
|
| 450 |
- Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
|
| 451 |
- `.mp3` audio files still unsupported
|
| 452 |
- `.py` code execution still unsupported
|
| 453 |
|
| 454 |
**Next Steps:**
|
|
|
|
| 455 |
1. Test new download implementation
|
| 456 |
2. Expand tool support for .mp3 (audio transcription)
|
| 457 |
3. Expand tool support for .py (code execution)
|
| 458 |
|
| 459 |
---
|
| 460 |
|
| 461 |
-
## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision Validated - Ready for GAIA
|
| 462 |
|
| 463 |
**Problem:** Need to validate HF vision works before complex GAIA evaluation.
|
| 464 |
|
| 465 |
**Solution:** Single smoke test with simple red square image.
|
| 466 |
|
| 467 |
**Result:** ✅ PASSED
|
|
|
|
| 468 |
- Model: `google/gemma-3-27b-it:scaleway`
|
| 469 |
- Answer: "The image is a solid, uniform field of red color..."
|
| 470 |
- Provider routing: Working correctly
|
| 471 |
- Settings integration: Fixed
|
| 472 |
|
| 473 |
**Modified Files:**
|
|
|
|
| 474 |
- **src/config/settings.py** (~5 lines added)
|
| 475 |
- Added `HF_TOKEN` and `HF_VISION_MODEL` config
|
| 476 |
- Added `hf_token` and `hf_vision_model` to Settings class
|
|
@@ -480,6 +647,7 @@ elif isinstance(result, str):
|
|
| 480 |
- Tests basic image description
|
| 481 |
|
| 482 |
**Bug Fixes:**
|
|
|
|
| 483 |
- Removed unsupported `timeout` parameter from `chat_completion()`
|
| 484 |
|
| 485 |
**Next Steps:** Phase 3 - GAIA evaluation with HF vision
|
|
@@ -491,11 +659,13 @@ elif isinstance(result, str):
|
|
| 491 |
**Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
|
| 492 |
|
| 493 |
**Solution:**
|
|
|
|
| 494 |
- Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
|
| 495 |
- Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
|
| 496 |
- Each provider fails independently (NO fallback chains during testing)
|
| 497 |
|
| 498 |
**Modified Files:**
|
|
|
|
| 499 |
- **src/tools/vision.py** (~120 lines added/modified)
|
| 500 |
- Added `analyze_image_hf()` function with retry logic
|
| 501 |
- Updated `analyze_image()` routing with provider selection
|
|
@@ -505,14 +675,15 @@ elif isinstance(result, str):
|
|
| 505 |
|
| 506 |
**Validated Models (Phase 0 Extended Testing):**
|
| 507 |
|
| 508 |
-
| Rank | Model
|
| 509 |
-
|
| 510 |
-
| 1
|
| 511 |
-
| 2
|
| 512 |
-
| 3
|
| 513 |
-
| 4
|
| 514 |
|
| 515 |
**Failed Models (not vision-capable):**
|
|
|
|
| 516 |
- `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
|
| 517 |
- `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
|
| 518 |
- `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
|
|
@@ -531,17 +702,20 @@ elif isinstance(result, str):
|
|
| 531 |
**Test Results:**
|
| 532 |
|
| 533 |
**Working Models:**
|
|
|
|
| 534 |
- `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
|
| 535 |
- `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
|
| 536 |
- `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
|
| 537 |
|
| 538 |
**Failed Models:**
|
|
|
|
| 539 |
- `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
|
| 540 |
- `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
|
| 541 |
- `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
|
| 542 |
- `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
|
| 543 |
|
| 544 |
**Output Files:**
|
|
|
|
| 545 |
- `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
|
| 546 |
- `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
|
| 547 |
- `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
|
|
@@ -601,11 +775,11 @@ elif isinstance(result, str):
|
|
| 601 |
|
| 602 |
**Critical Discovery - Large Image Handling:**
|
| 603 |
|
| 604 |
-
| Model
|
| 605 |
-
|
| 606 |
-
| aya-vision-32b
|
| 607 |
-
| Qwen3-VL-8B-Instruct | ✅ 1-3s
|
| 608 |
-
| ERNIE-4.5-VL-424B
|
| 609 |
|
| 610 |
**API Behavior:**
|
| 611 |
|
|
@@ -693,28 +867,33 @@ elif isinstance(result, str):
|
|
| 693 |
**Solution - Plan Corrections Applied:**
|
| 694 |
|
| 695 |
1. **Added Phase 0: API Validation (CRITICAL)**
|
|
|
|
| 696 |
- Test HF Inference API with vision models BEFORE implementation
|
| 697 |
- Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
|
| 698 |
- Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
|
| 699 |
- Time saved: Prevents 2-3 hours implementing non-functional code
|
| 700 |
|
| 701 |
2. **Removed Fallback Logic from Testing**
|
|
|
|
| 702 |
- Each provider fails independently with clear error message
|
| 703 |
- NO fallback chains (HF → Gemini → Claude) during testing
|
| 704 |
- Philosophy: Build capability knowledge, don't hide problems
|
| 705 |
- Log exact failure reasons for debugging
|
| 706 |
|
| 707 |
3. **Added Smoke Tests (Phase 2)**
|
|
|
|
| 708 |
- 4 tests before GAIA: description, OCR, counting, single GAIA question
|
| 709 |
- Decision gate: ≥3/4 must pass before full evaluation
|
| 710 |
- Prevents debugging chess positions when basic integration broken
|
| 711 |
|
| 712 |
4. **Added Decision Gates**
|
|
|
|
| 713 |
- Gate 1 (Phase 0): API validation → GO/NO-GO
|
| 714 |
- Gate 2 (Phase 2): Smoke tests → GO/NO-GO
|
| 715 |
- Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
|
| 716 |
|
| 717 |
5. **Added Backup Strategy Documentation**
|
|
|
|
| 718 |
- Option C: HF Spaces deployment (custom endpoint)
|
| 719 |
- Option D: Local transformers library (no API)
|
| 720 |
- Option E: Hybrid (HF text + Gemini/Claude vision)
|
|
@@ -741,14 +920,14 @@ elif isinstance(result, str):
|
|
| 741 |
|
| 742 |
**Key Changes Summary:**
|
| 743 |
|
| 744 |
-
| Before
|
| 745 |
-
|
| 746 |
-
| Jump to implementation
|
| 747 |
-
| Fallback chains
|
| 748 |
-
| Large models first (Qwen2-VL) | Small models first (Phi-3.5)
|
| 749 |
-
| Direct to GAIA
|
| 750 |
-
| No backup plan
|
| 751 |
-
| Single success criteria
|
| 752 |
|
| 753 |
**Benefits:**
|
| 754 |
|
|
@@ -859,10 +1038,11 @@ def analyze_image(image_path: str, question: Optional[str] = None) -> Dict:
|
|
| 859 |
**Modified Files:**
|
| 860 |
|
| 861 |
- **app.py** (~10 lines modified)
|
|
|
|
| 862 |
- Removed environment detection logic (`if os.getenv("SPACE_ID")`)
|
| 863 |
- Changed: `exports/` → `_cache/`
|
| 864 |
-
- Updated docstring: "All environments: Saves to
|
| 865 |
-
- Updated comment: "Save to _cache/ folder (internal runtime storage, not accessible via HF UI)"
|
| 866 |
|
| 867 |
- **.gitignore** (~3 lines added)
|
| 868 |
- Added `_cache/` to ignore list
|
|
|
|
| 1 |
# Session Changelog
|
| 2 |
|
| 3 |
+
## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation
|
| 4 |
+
|
| 5 |
+
**Discovery:** HuggingFace Provider Suffix Behavior - Auto-Routing is Bad Practice
|
| 6 |
+
|
| 7 |
+
**Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
|
| 8 |
+
|
| 9 |
+
**Test Result:**
|
| 10 |
+
```python
|
| 11 |
+
# Without provider - WORKS but uses HF default routing
|
| 12 |
+
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # ✅ Works, but...
|
| 13 |
+
# Response: "Test successful."
|
| 14 |
+
|
| 15 |
+
# With explicit provider - RECOMMENDED
|
| 16 |
+
HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway" # ✅ Reliable
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
**Why Auto-Routing is Bad Practice:**
|
| 20 |
+
|
| 21 |
+
| Issue | Impact |
|
| 22 |
+
|-------|--------|
|
| 23 |
+
| **Unpredictable performance** | Provider changes between runs (fast Cerebras → slow Together) |
|
| 24 |
+
| **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
|
| 25 |
+
| **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
|
| 26 |
+
| **Debugging nightmare** | Can't reproduce issues when provider is unknown |
|
| 27 |
+
| **Silent failures** | Provider might be down, HF retries with different one |
|
| 28 |
+
|
| 29 |
+
**Best Practice: ALWAYS specify provider**
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
# BAD - Unreliable
|
| 33 |
+
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct"
|
| 34 |
+
|
| 35 |
+
# GOOD - Explicit, predictable
|
| 36 |
+
HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway"
|
| 37 |
+
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct:cerebras"
|
| 38 |
+
HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
**Available Providers for Text Models:**
|
| 42 |
+
- `:scaleway` - Fast, reliable (recommended for Llama)
|
| 43 |
+
- `:cerebras` - Very fast (recommended for Qwen)
|
| 44 |
+
- `:novita` - Fast, reputable
|
| 45 |
+
- `:together` - Reliable
|
| 46 |
+
- `:sambanova` - Fast but expensive
|
| 47 |
+
|
| 48 |
+
**Action Taken:** Updated code to always use explicit `:provider` suffix
|
| 49 |
+
|
| 50 |
+
---
|
| 51 |
+
|
| 52 |
+
## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
|
| 53 |
+
|
| 54 |
+
**Model Changes:**
|
| 55 |
+
1. Qwen 2.5 72B (no provider) → Failed synthesis ("Unable to answer")
|
| 56 |
+
2. Llama 3.3 70B (Scaleway) → Failed synthesis
|
| 57 |
+
3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
|
| 58 |
+
|
| 59 |
+
**openai/gpt-oss-120b:**
|
| 60 |
+
- OpenAI's 120B parameter open source model
|
| 61 |
+
- Strong reasoning capability
|
| 62 |
+
- Optimized for function calling and tool use
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Investigation (Original)
|
| 67 |
+
|
| 68 |
+
**Problem:** Qwen 2.5 72B fails synthesis despite having complete transcript evidence (738 chars).
|
| 69 |
+
|
| 70 |
+
**Root Cause Analysis:**
|
| 71 |
+
|
| 72 |
+
- Transcript contains all 3 species: "giant petrel", "emperor", "adelie" (Whisper error: "deli")
|
| 73 |
+
- Qwen 2.5 cannot resolve transcription errors ("deli" → "adelie penguin")
|
| 74 |
+
- Qwen 2.5 weak at entity extraction + counting from noisy text
|
| 75 |
+
- Returns "Unable to answer" instead of reasoning through ambiguity
|
| 76 |
+
|
| 77 |
+
**Transcript Quality Assessment:**
|
| 78 |
+
|
| 79 |
+
- **NOT clear enough for current LLM** - requires:
|
| 80 |
+
1. Error tolerance ("deli" → "adelie")
|
| 81 |
+
2. World knowledge (Antarctic bird species)
|
| 82 |
+
3. Entity extraction from narrative text
|
| 83 |
+
4. Temporal reasoning ("simultaneously" = same scene)
|
| 84 |
+
|
| 85 |
+
**Answer from transcript:** 3 species (giant petrel, emperor penguin, adelie penguin)
|
| 86 |
+
|
| 87 |
+
**Solution:** Upgrade to Llama 3.3 70B Instruct (Scaleway provider)
|
| 88 |
+
|
| 89 |
+
- Better reasoning and instruction following
|
| 90 |
+
- Stronger entity extraction from noisy context
|
| 91 |
+
- Better at handling transcription ambiguities
|
| 92 |
+
|
| 93 |
+
**Modified Files:**
|
| 94 |
+
|
| 95 |
+
- **src/agent/llm_client.py** (line 37) - Model: Qwen 2.5 → Llama 3.3 70B
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Transcript Caching for Debugging
|
| 100 |
+
|
| 101 |
+
**Problem:** Transcription works (738 chars from Whisper) but LLM returns "Unable to answer". Need to inspect raw transcript to debug synthesis failure.
|
| 102 |
+
|
| 103 |
+
**Solution:** Added `save_transcript_to_cache()` function to save transcripts to `_cache/{video_id}_transcript.txt` for both API and Whisper paths.
|
| 104 |
+
|
| 105 |
+
**Modified Files:**
|
| 106 |
+
|
| 107 |
+
- **src/tools/youtube.py** (+30 lines)
|
| 108 |
+
- Added `save_transcript_to_cache()` function (lines 55-79)
|
| 109 |
+
- Calls after successful API transcript retrieval (line 164)
|
| 110 |
+
- Calls after successful Whisper transcription (line 317)
|
| 111 |
+
- File format includes metadata: video_id, source, length, timestamp
|
| 112 |
+
|
| 113 |
+
**File Format:**
|
| 114 |
+
|
| 115 |
+
```
|
| 116 |
+
# YouTube Transcript
|
| 117 |
+
# Video ID: L1vXCYZAYYM
|
| 118 |
+
# Source: whisper
|
| 119 |
+
# Length: 738 characters
|
| 120 |
+
# Generated: 2026-01-13T02:27:...
|
| 121 |
+
|
| 122 |
+
<transcript text>
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
**Next Steps:**
|
| 126 |
+
|
| 127 |
+
- Test on question #3 (bird species) - inspect cached transcript
|
| 128 |
+
- Debug LLM synthesis failure if transcript contains correct answer
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
## [2026-01-13] [Stage 1: YouTube Support] [COMPLETED] Phase 1 - YouTube Transcript + Whisper Audio Transcription
|
| 133 |
|
| 134 |
**Problem:** Questions #3 and #5 (YouTube videos) failed because vision tool cannot process YouTube URLs.
|
|
|
|
| 136 |
**Solution:** Implemented YouTube transcript extraction with Whisper audio fallback.
|
| 137 |
|
| 138 |
**Modified Files:**
|
| 139 |
+
|
| 140 |
- **src/tools/audio.py** (200 lines) - New: Whisper transcription with @spaces.GPU decorator for ZeroGPU acceleration
|
| 141 |
- **src/tools/youtube.py** (370 lines) - New: YouTube transcript extraction (youtube-transcript-api) with Whisper fallback
|
| 142 |
+
- **src/tools/**init**.py** (~30 lines) - Registered youtube_transcript and transcribe_audio tools
|
| 143 |
- **requirements.txt** (+4 lines) - Added youtube-transcript-api, openai-whisper, yt-dlp
|
| 144 |
- **brainstorming_phase1_youtube.md** (+120 lines) - Documented ZeroGPU requirement, industry validation
|
| 145 |
|
| 146 |
**Key Technical Decisions:**
|
| 147 |
+
|
| 148 |
- **Primary method:** youtube-transcript-api (instant, 1-3 seconds, 92% success rate)
|
| 149 |
- **Fallback method:** yt-dlp audio extraction + Whisper transcription (30s-2min)
|
| 150 |
- **ZeroGPU setup:** @spaces.GPU decorator required for HF Spaces (prevents "No @spaces.GPU function detected" error)
|
|
|
|
| 152 |
- **Unified architecture:** Single `transcribe_audio()` function for Phase 1 (YouTube fallback) and Phase 2 (MP3 files)
|
| 153 |
|
| 154 |
**Expected Impact:**
|
| 155 |
+
|
| 156 |
- Questions #3, #5: Should now be solvable (transcript provides dialogue/species info)
|
| 157 |
- Score: 10% → 40% (2/20 → 4/20 correct)
|
| 158 |
- **Target achieved:** Exceeds 30% requirement (6/20)
|
| 159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
---
|
| 161 |
|
| 162 |
## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
|
|
|
|
| 167 |
|
| 168 |
### Fixed (Course API Contract - DO NOT CHANGE)
|
| 169 |
|
| 170 |
+
| Aspect | Value | Cannot Change |
|
| 171 |
+
| ----------------------- | -------------------------------------- | ------------- |
|
| 172 |
+
| **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
|
| 173 |
+
| **Questions Route** | `GET /questions` | ❌ |
|
| 174 |
+
| **Submit Route** | `POST /submit` | ❌ |
|
| 175 |
+
| **Number of Questions** | **20** (always 20) | ❌ |
|
| 176 |
+
| **Question Source** | GAIA validation set, level 1 | ❌ |
|
| 177 |
+
| **Randomness** | **NO - Fixed set** | ❌ |
|
| 178 |
+
| **Difficulty** | All level 1 (easiest) | ❌ |
|
| 179 |
+
| **Filter Criteria** | By tools/steps complexity | ❌ |
|
| 180 |
+
| **Scoring** | EXACT MATCH | ❌ |
|
| 181 |
+
| **Target Score** | 30% = 6/20 correct | ❌ |
|
| 182 |
|
| 183 |
### The 20 Questions (ALWAYS the Same)
|
| 184 |
|
| 185 |
+
| # | Full Task ID | Description | Tools Required |
|
| 186 |
+
| --- | -------------------------------------- | ------------------------------ | ---------------- |
|
| 187 |
+
| 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
|
| 188 |
+
| 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
|
| 189 |
+
| 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
|
| 190 |
+
| 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
|
| 191 |
+
| 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
|
| 192 |
+
| 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
|
| 193 |
+
| 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
|
| 194 |
+
| 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
|
| 195 |
+
| 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
|
| 196 |
+
| 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
|
| 197 |
+
| 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
|
| 198 |
+
| 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
|
| 199 |
+
| 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
|
| 200 |
+
| 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
|
| 201 |
+
| 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
|
| 202 |
+
| 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
|
| 203 |
+
| 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
|
| 204 |
+
| 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
|
| 205 |
+
| 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
|
| 206 |
+
| 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
|
| 207 |
|
| 208 |
**NOT random** - same 20 questions every submission!
|
| 209 |
|
|
|
|
| 223 |
|
| 224 |
### Our Additions (SAFE to Modify)
|
| 225 |
|
| 226 |
+
| Feature | Purpose | Required? |
|
| 227 |
+
| ------------------ | ---------------------- | ----------- |
|
| 228 |
+
| Question Limit | Debug: run first N | ✅ Optional |
|
| 229 |
+
| Target Task IDs | Debug: run specific | ✅ Optional |
|
| 230 |
+
| ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
|
| 231 |
+
| System Error Field | UX: error tracking | ✅ Optional |
|
| 232 |
| File Download (HF) | Feature: support files | ✅ Optional |
|
| 233 |
|
| 234 |
### Key Learnings
|
|
|
|
| 248 |
**Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
|
| 249 |
|
| 250 |
**Process:**
|
| 251 |
+
|
| 252 |
1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
|
| 253 |
2. Removed git-specific files (`.git/` folder, `.gitattributes`)
|
| 254 |
3. Copied to project as `_template_original/` (static reference, no git)
|
| 255 |
4. Cleaned up temporary clone from Downloads
|
| 256 |
|
| 257 |
**Why Static Reference:**
|
| 258 |
+
|
| 259 |
- No `.git/` folder → won't interfere with project's git
|
| 260 |
- No `.gitattributes` → clean file comparison
|
| 261 |
- Pure reference material for diff/comparison
|
| 262 |
- Can see exactly what changed from original
|
| 263 |
|
| 264 |
**Template Original Contents:**
|
| 265 |
+
|
| 266 |
- `app.py` (8777 bytes - original)
|
| 267 |
- `README.md` (400 bytes - original)
|
| 268 |
- `requirements.txt` (15 bytes - original)
|
| 269 |
|
| 270 |
**Comparison Commands:**
|
| 271 |
+
|
| 272 |
```bash
|
| 273 |
# Compare file sizes
|
| 274 |
ls -lh _template_original/app.py app.py
|
|
|
|
| 281 |
```
|
| 282 |
|
| 283 |
**Created Files:**
|
| 284 |
+
|
| 285 |
+
- **\_template_original/** (NEW) - Static reference to original template (3 files)
|
| 286 |
|
| 287 |
---
|
| 288 |
|
|
|
|
| 291 |
**Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
|
| 292 |
|
| 293 |
**Actions Taken:**
|
| 294 |
+
|
| 295 |
1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
|
| 296 |
2. Updated local git remote to point to new URL
|
| 297 |
3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
|
|
|
|
| 299 |
5. Pushed commits to renamed Space: `c86df49..41ac444`
|
| 300 |
|
| 301 |
**Key Learnings:**
|
| 302 |
+
|
| 303 |
- Local folder name ≠ git repo identity (can rename locally without affecting remote)
|
| 304 |
- Git remote URL determines push destination (updated to `agentbee`)
|
| 305 |
- HuggingFace Space name is independent of local folder name
|
| 306 |
- All work preserved through rename process
|
| 307 |
|
| 308 |
**Current State:**
|
| 309 |
+
|
| 310 |
- Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
|
| 311 |
- Remote: `mangubee/agentbee` (renamed on HuggingFace)
|
| 312 |
- Sync: ✅ All changes pushed
|
|
|
|
| 322 |
**Root Cause:** Template code includes course API (`agents-course-unit4-scoring.hf.space`), but documentation didn't clarify the distinction between course leaderboard and official GAIA leaderboard.
|
| 323 |
|
| 324 |
**Solution:** Created `docs/gaia_submission_guide.md` documenting:
|
| 325 |
+
|
| 326 |
- **Course Leaderboard** (current): 20 questions, 30% target, course-specific API
|
| 327 |
- **Official GAIA Leaderboard** (future): 450+ questions, different submission format
|
| 328 |
- API routes, submission formats, scoring differences
|
|
|
|
| 338 |
| Submission | JSON POST | File upload |
|
| 339 |
|
| 340 |
**Created Files:**
|
| 341 |
+
|
| 342 |
- **docs/gaia_submission_guide.md** - Complete submission guide for both leaderboards
|
| 343 |
|
| 344 |
**Modified Files:**
|
| 345 |
+
|
| 346 |
- **README.md** - Added note linking to submission guide
|
| 347 |
|
| 348 |
---
|
|
|
|
| 354 |
**Solution:** Added "Target Task IDs (Debug)" field in Full Evaluation tab. Enter comma-separated task IDs to run only those questions.
|
| 355 |
|
| 356 |
**Implementation:**
|
| 357 |
+
|
| 358 |
- Added `eval_task_ids` textbox in UI (line 763-770)
|
| 359 |
- Updated `run_and_submit_all()` signature: `task_ids: str = ""` parameter
|
| 360 |
- Filtering logic: Parses comma-separated IDs, filters `questions_data`
|
|
|
|
| 362 |
- Overrides question_limit when provided
|
| 363 |
|
| 364 |
**Usage:**
|
| 365 |
+
|
| 366 |
```
|
| 367 |
Target Task IDs: 2d83110e-a098-4ebb-9987-066c06fa42d0, cca530fc-4052-43b2-b130-b30968d8aa44
|
| 368 |
```
|
| 369 |
|
| 370 |
**Modified Files:**
|
| 371 |
+
|
| 372 |
- **app.py** (~30 lines added)
|
| 373 |
- UI: `eval_task_ids` textbox
|
| 374 |
- `run_and_submit_all()`: Added `task_ids` parameter, filtering logic
|
|
|
|
| 385 |
**Solution:** Made timeout protection optional - catches ValueError/AttributeError and disables timeout with warning when not in main thread. SafeEvaluator still has other protections (whitelisted operations, number size limits).
|
| 386 |
|
| 387 |
**Modified Files:**
|
| 388 |
+
|
| 389 |
- **src/tools/calculator.py** (~15 lines modified)
|
| 390 |
- `timeout()` context manager: Try/except for signal.alarm() failure
|
| 391 |
- Logs warning when timeout protection disabled
|
|
|
|
| 398 |
**Problem:** "Unable to answer" output was ambiguous - unclear if technical failure or AI response. User requested simpler distinction: system error vs AI answer.
|
| 399 |
|
| 400 |
**Solution:** Changed to boolean `system_error: yes/no` field:
|
| 401 |
+
|
| 402 |
- `system_error: yes` - Technical/system error from our code (don't submit)
|
| 403 |
- `system_error: no` - AI response (submit answer, even if wrong)
|
| 404 |
- Added `error_log` field with full error details for system errors
|
| 405 |
|
| 406 |
**Implementation:**
|
| 407 |
+
|
| 408 |
- `a_determine_status()` returns `(is_error: bool, error_log: str | None)`
|
| 409 |
- Results table: "System Error" column (yes/no), "Error Log" column (when yes)
|
| 410 |
- JSON export: `system_error` field, `error_log` field (when system error)
|
| 411 |
- Submission logic: Only submit when `system_error == "no"`
|
| 412 |
|
| 413 |
**Modified Files:**
|
| 414 |
+
|
| 415 |
- **app.py** (~30 lines modified)
|
| 416 |
- `a_determine_status()`: Returns tuple instead of string
|
| 417 |
- `process_single_question()`: Uses new format, adds `error_log`
|
|
|
|
| 425 |
**Problem:** Fallback mechanism was archived in `src/agent/llm_client.py` but UI checkboxes remained in app.py
|
| 426 |
|
| 427 |
**Solution:** Removed all fallback-related UI elements:
|
| 428 |
+
|
| 429 |
- Removed `enable_fallback_checkbox` from Test Question tab
|
| 430 |
- Removed `eval_enable_fallback_checkbox` from Full Evaluation tab
|
| 431 |
- Removed `enable_fallback` parameter from `test_single_question()` function
|
|
|
|
| 434 |
- Simplified provider info display (no longer shows "Fallback: Enabled/Disabled")
|
| 435 |
|
| 436 |
**Modified Files:**
|
| 437 |
+
|
| 438 |
- **app.py** (~20 lines removed)
|
| 439 |
- Test Question tab: Removed `enable_fallback_checkbox` (line 664-668)
|
| 440 |
- Full Evaluation tab: Removed `eval_enable_fallback_checkbox` (line 710-714)
|
|
|
|
| 446 |
## [2026-01-12] [Refactoring] [COMPLETED] Fallback Mechanism Archived
|
| 447 |
|
| 448 |
**Problem:** Fallback mechanism (`ENABLE_LLM_FALLBACK`) creating double work:
|
| 449 |
+
|
| 450 |
- 4 providers to test for each feature
|
| 451 |
- Complex debugging with multiple code paths
|
| 452 |
- Longer, less clear error messages
|
| 453 |
- Adding complexity without clear benefit
|
| 454 |
|
| 455 |
**Solution:** Archive fallback mechanism, use single provider only
|
| 456 |
+
|
| 457 |
- Removed fallback provider loop (Gemini → HF → Groq → Claude)
|
| 458 |
- Simplified `_call_with_fallback()` from ~60 lines to ~35 lines
|
| 459 |
- If provider fails, error is raised immediately
|
| 460 |
- Original code preserved in git history and `dev/dev_260112_02_fallback_archived.md`
|
| 461 |
|
| 462 |
**Benefits:**
|
| 463 |
+
|
| 464 |
- ✅ Reduced code complexity
|
| 465 |
- ✅ Faster debugging (one code path)
|
| 466 |
- ✅ Clearer error messages
|
| 467 |
- ✅ No double work on features
|
| 468 |
|
| 469 |
**Modified Files:**
|
| 470 |
+
|
| 471 |
- **src/agent/llm_client.py** (~25 lines removed)
|
| 472 |
- Simplified `_call_with_fallback()`: Removed fallback logic
|
| 473 |
- **dev/dev_260112_02_fallback_archived.md** (NEW)
|
|
|
|
| 481 |
**Problem:** Score dropped from 5% → 0% after first evidence fix. Evidence showing dict string representation: `{'results': [{'title': '...', ...}]`
|
| 482 |
|
| 483 |
**Root Cause:** First fix only handled dicts with `"answer"` key (vision tools). Search tools return different dict structure with `"results"` key:
|
| 484 |
+
|
| 485 |
```python
|
| 486 |
{"results": [...], "source": "tavily", "query": "...", "count": N}
|
| 487 |
```
|
| 488 |
|
| 489 |
**Solution:** Handle both dict formats in evidence extraction:
|
| 490 |
+
|
| 491 |
```python
|
| 492 |
if isinstance(result, dict):
|
| 493 |
if "answer" in result:
|
|
|
|
| 502 |
```
|
| 503 |
|
| 504 |
**Modified Files:**
|
| 505 |
+
|
| 506 |
- **src/agent/graph.py** (~40 lines modified)
|
| 507 |
- Updated evidence extraction in primary path
|
| 508 |
- Updated evidence extraction in fallback path
|
|
|
|
| 510 |
**Test Result:** Evidence now formatted correctly. Search quality still variable (LLM sometimes picks wrong info).
|
| 511 |
|
| 512 |
**Summary of Fixes (Session 2026-01-12):**
|
| 513 |
+
|
| 514 |
1. ✅ File download from HF dataset (5/5 files)
|
| 515 |
2. ✅ Absolute paths from script location
|
| 516 |
3. ✅ Evidence formatting for vision tools (dict → answer)
|
|
|
|
| 525 |
**Root Cause:** Vision tool returns dict: `{'answer': '...', 'model': '...', 'image_path': '...'}`. But `execute_node` was converting this to string: `"[vision] {'answer': '...', ...}"`. The synthesize_answer LLM couldn't parse this format.
|
| 526 |
|
| 527 |
**Solution:** Extract 'answer' field from dict results before adding to evidence:
|
| 528 |
+
|
| 529 |
```python
|
| 530 |
# Before
|
| 531 |
evidence.append(f"[{tool_name}] {result}") # Dict → string representation
|
|
|
|
| 538 |
```
|
| 539 |
|
| 540 |
**Modified Files:**
|
| 541 |
+
|
| 542 |
- **src/agent/graph.py** (~15 lines modified)
|
| 543 |
- Updated `execute_node()`: Extract 'answer' from dict results
|
| 544 |
- Fixed both primary and fallback execution paths
|
|
|
|
| 556 |
**Root Cause:** `download_task_file()` returned relative path (`_cache/gaia_files/xxx.png`). During Gradio execution, working directory may differ, causing `Path(image_path).exists()` check in vision tool to fail.
|
| 557 |
|
| 558 |
**Solution:** Return absolute paths from `download_task_file()`
|
| 559 |
+
|
| 560 |
- Changed: `target_path = os.path.join(save_dir, file_name)`
|
| 561 |
- To: `target_path = os.path.abspath(os.path.join(save_dir, file_name))`
|
| 562 |
- Now tools can find files regardless of working directory
|
| 563 |
|
| 564 |
**Modified Files:**
|
| 565 |
+
|
| 566 |
- **app.py** (~3 lines modified)
|
| 567 |
- Updated `download_task_file()`: Return absolute paths using `os.path.abspath()`
|
| 568 |
|
|
|
|
| 575 |
**Problem:** Attempted to use evaluation API `/files/{task_id}` endpoint to download GAIA question files, but it returns 404 because files are not hosted on the evaluation server.
|
| 576 |
|
| 577 |
**Investigation:**
|
| 578 |
+
|
| 579 |
- Checked API spec: Endpoint exists with proper documentation
|
| 580 |
- Tested download: HTTP 404 "No file path associated with task_id"
|
| 581 |
- Verified HF Space: Only 5 files (Dockerfile, README, main.py, requirements.txt, .gitattributes) - NO data files
|
|
|
|
| 584 |
**Root Cause:** The evaluation API returns file metadata (`file_name`) but does NOT host actual files. Files are hosted separately in the GAIA dataset.
|
| 585 |
|
| 586 |
**Solution:** Switch from evaluation API to GAIA dataset download
|
| 587 |
+
|
| 588 |
- Use `huggingface_hub.hf_hub_download()` to fetch files
|
| 589 |
- Download to `_cache/gaia_files/` (runtime cache)
|
| 590 |
- File structure: `2023/validation/{task_id}.{ext}` or `2023/test/{task_id}.{ext}`
|
| 591 |
- Added cache checking (reuse downloaded files)
|
| 592 |
|
| 593 |
**Files with attachments (5/20 questions):**
|
| 594 |
+
|
| 595 |
- `cca530fc`: Chess position image (.png)
|
| 596 |
- `99c9cc74`: Pie recipe audio (.mp3)
|
| 597 |
- `f918266a`: Python code (.py)
|
|
|
|
| 599 |
- `7bd855d8`: Menu sales Excel (.xlsx)
|
| 600 |
|
| 601 |
**Modified Files:**
|
| 602 |
+
|
| 603 |
- **app.py** (~70 lines modified)
|
| 604 |
- Updated `download_task_file()`: Changed from evaluation API to HF dataset download
|
| 605 |
- Changed signature: `download_task_file(task_id, file_name, save_dir)`
|
|
|
|
| 610 |
- Updated `process_single_question()`: Pass `file_name` to download function
|
| 611 |
|
| 612 |
**Known Limitations:**
|
| 613 |
+
|
| 614 |
- Current `parse_file` tool only supports: `.pdf, .xlsx, .xls, .docx, .txt, .csv`
|
| 615 |
- `.mp3` audio files still unsupported
|
| 616 |
- `.py` code execution still unsupported
|
| 617 |
|
| 618 |
**Next Steps:**
|
| 619 |
+
|
| 620 |
1. Test new download implementation
|
| 621 |
2. Expand tool support for .mp3 (audio transcription)
|
| 622 |
3. Expand tool support for .py (code execution)
|
| 623 |
|
| 624 |
---
|
| 625 |
|
| 626 |
+
## [2026-01-11] [Phase 2: Smoke Tests] [COMPLETED] HF Vision LLM Validated - Ready for GAIA
|
| 627 |
|
| 628 |
**Problem:** Need to validate HF vision works before complex GAIA evaluation.
|
| 629 |
|
| 630 |
**Solution:** Single smoke test with simple red square image.
|
| 631 |
|
| 632 |
**Result:** ✅ PASSED
|
| 633 |
+
|
| 634 |
- Model: `google/gemma-3-27b-it:scaleway`
|
| 635 |
- Answer: "The image is a solid, uniform field of red color..."
|
| 636 |
- Provider routing: Working correctly
|
| 637 |
- Settings integration: Fixed
|
| 638 |
|
| 639 |
**Modified Files:**
|
| 640 |
+
|
| 641 |
- **src/config/settings.py** (~5 lines added)
|
| 642 |
- Added `HF_TOKEN` and `HF_VISION_MODEL` config
|
| 643 |
- Added `hf_token` and `hf_vision_model` to Settings class
|
|
|
|
| 647 |
- Tests basic image description
|
| 648 |
|
| 649 |
**Bug Fixes:**
|
| 650 |
+
|
| 651 |
- Removed unsupported `timeout` parameter from `chat_completion()`
|
| 652 |
|
| 653 |
**Next Steps:** Phase 3 - GAIA evaluation with HF vision
|
|
|
|
| 659 |
**Problem:** Vision tool hardcoded to Gemini → Claude, ignoring UI LLM selection.
|
| 660 |
|
| 661 |
**Solution:**
|
| 662 |
+
|
| 663 |
- Added `analyze_image_hf()` function using `google/gemma-3-27b-it:scaleway` (fastest, ~6s)
|
| 664 |
- Fixed `analyze_image()` routing to respect `LLM_PROVIDER` environment variable
|
| 665 |
- Each provider fails independently (NO fallback chains during testing)
|
| 666 |
|
| 667 |
**Modified Files:**
|
| 668 |
+
|
| 669 |
- **src/tools/vision.py** (~120 lines added/modified)
|
| 670 |
- Added `analyze_image_hf()` function with retry logic
|
| 671 |
- Updated `analyze_image()` routing with provider selection
|
|
|
|
| 675 |
|
| 676 |
**Validated Models (Phase 0 Extended Testing):**
|
| 677 |
|
| 678 |
+
| Rank | Model | Provider | Speed | Notes |
|
| 679 |
+
| ---- | -------------------------------- | -------- | ----- | ------------------------------ |
|
| 680 |
+
| 1 | `google/gemma-3-27b-it` | Scaleway | ~6s | **RECOMMENDED** - Google brand |
|
| 681 |
+
| 2 | `CohereLabs/aya-vision-32b` | Cohere | ~7s | Fast, less known brand |
|
| 682 |
+
| 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
|
| 683 |
+
| 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
|
| 684 |
|
| 685 |
**Failed Models (not vision-capable):**
|
| 686 |
+
|
| 687 |
- `zai-org/GLM-4.7:cerebras` - Text-only (422 error: "Content type 'image_url' not supported")
|
| 688 |
- `openai/gpt-oss-120b:novita` - Text-only (400 Bad request)
|
| 689 |
- `openai/gpt-oss-120b:groq` - Text-only (400: "content must be a string")
|
|
|
|
| 702 |
**Test Results:**
|
| 703 |
|
| 704 |
**Working Models:**
|
| 705 |
+
|
| 706 |
- `google/gemma-3-27b-it:scaleway` ✅ - ~6s, Google brand, **RECOMMENDED**
|
| 707 |
- `zai-org/GLM-4.6V-Flash:zai-org` ✅ - ~16s, Zhipu AI brand
|
| 708 |
- `Qwen/Qwen3-VL-30B-A3B-Instruct:novita` ✅ - ~14s, Qwen brand
|
| 709 |
|
| 710 |
**Failed Models:**
|
| 711 |
+
|
| 712 |
- `zai-org/GLM-4.7:cerebras` ❌ - Text-only model (422: "image_url not supported")
|
| 713 |
- `openai/gpt-oss-120b:novita` ❌ - Generic 400 Bad request
|
| 714 |
- `openai/gpt-oss-120b:groq` ❌ - Text-only (400: "content must be a string")
|
| 715 |
- `moonshotai/Kimi-K2-Instruct-0905:novita` ❌ - Generic 400 Bad request
|
| 716 |
|
| 717 |
**Output Files:**
|
| 718 |
+
|
| 719 |
- `output/phase0_vision_validation_20260111_162124.json` - 4 new models test
|
| 720 |
- `output/phase0_vision_validation_20260111_163647.json` - Groq provider test
|
| 721 |
- `output/phase0_vision_validation_20260111_164531.json` - GLM-4.6V test
|
|
|
|
| 775 |
|
| 776 |
**Critical Discovery - Large Image Handling:**
|
| 777 |
|
| 778 |
+
| Model | Small Image (1KB) | Large Image (2.8MB) | Recommendation |
|
| 779 |
+
| -------------------- | ----------------- | ------------------- | ---------------------------- |
|
| 780 |
+
| aya-vision-32b | ✅ 1-3s | ✅ ~10s | **Use for production** |
|
| 781 |
+
| Qwen3-VL-8B-Instruct | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
|
| 782 |
+
| ERNIE-4.5-VL-424B | ✅ 1-3s | ❌ >120s timeout | Use with image preprocessing |
|
| 783 |
|
| 784 |
**API Behavior:**
|
| 785 |
|
|
|
|
| 867 |
**Solution - Plan Corrections Applied:**
|
| 868 |
|
| 869 |
1. **Added Phase 0: API Validation (CRITICAL)**
|
| 870 |
+
|
| 871 |
- Test HF Inference API with vision models BEFORE implementation
|
| 872 |
- Model order: Phi-3.5 (3.8B) → Llama-3.2 (11B) → Qwen2-VL (72B)
|
| 873 |
- Decision gate: Only proceed if ≥1 model works, otherwise pivot to backup options
|
| 874 |
- Time saved: Prevents 2-3 hours implementing non-functional code
|
| 875 |
|
| 876 |
2. **Removed Fallback Logic from Testing**
|
| 877 |
+
|
| 878 |
- Each provider fails independently with clear error message
|
| 879 |
- NO fallback chains (HF → Gemini → Claude) during testing
|
| 880 |
- Philosophy: Build capability knowledge, don't hide problems
|
| 881 |
- Log exact failure reasons for debugging
|
| 882 |
|
| 883 |
3. **Added Smoke Tests (Phase 2)**
|
| 884 |
+
|
| 885 |
- 4 tests before GAIA: description, OCR, counting, single GAIA question
|
| 886 |
- Decision gate: ≥3/4 must pass before full evaluation
|
| 887 |
- Prevents debugging chess positions when basic integration broken
|
| 888 |
|
| 889 |
4. **Added Decision Gates**
|
| 890 |
+
|
| 891 |
- Gate 1 (Phase 0): API validation → GO/NO-GO
|
| 892 |
- Gate 2 (Phase 2): Smoke tests → GO/NO-GO
|
| 893 |
- Gate 3 (Phase 3): GAIA accuracy ≥20% → Continue or iterate
|
| 894 |
|
| 895 |
5. **Added Backup Strategy Documentation**
|
| 896 |
+
|
| 897 |
- Option C: HF Spaces deployment (custom endpoint)
|
| 898 |
- Option D: Local transformers library (no API)
|
| 899 |
- Option E: Hybrid (HF text + Gemini/Claude vision)
|
|
|
|
| 920 |
|
| 921 |
**Key Changes Summary:**
|
| 922 |
|
| 923 |
+
| Before | After |
|
| 924 |
+
| ----------------------------- | ----------------------------------- |
|
| 925 |
+
| Jump to implementation | Phase 0: Validate API first |
|
| 926 |
+
| Fallback chains | No fallbacks, fail independently |
|
| 927 |
+
| Large models first (Qwen2-VL) | Small models first (Phi-3.5) |
|
| 928 |
+
| Direct to GAIA | Smoke tests → GAIA |
|
| 929 |
+
| No backup plan | 3 backup options documented |
|
| 930 |
+
| Single success criteria | Per-phase criteria + decision gates |
|
| 931 |
|
| 932 |
**Benefits:**
|
| 933 |
|
|
|
|
| 1038 |
**Modified Files:**
|
| 1039 |
|
| 1040 |
- **app.py** (~10 lines modified)
|
| 1041 |
+
|
| 1042 |
- Removed environment detection logic (`if os.getenv("SPACE_ID")`)
|
| 1043 |
- Changed: `exports/` → `_cache/`
|
| 1044 |
+
- Updated docstring: "All environments: Saves to ./\_cache/gaia_results_TIMESTAMP.json"
|
| 1045 |
+
- Updated comment: "Save to \_cache/ folder (internal runtime storage, not accessible via HF UI)"
|
| 1046 |
|
| 1047 |
- **.gitignore** (~3 lines added)
|
| 1048 |
- Added `_cache/` to ignore list
|
src/agent/llm_client.py
CHANGED
|
@@ -34,8 +34,9 @@ CLAUDE_MODEL = "claude-sonnet-4-5-20250929"
|
|
| 34 |
GEMINI_MODEL = "gemini-2.0-flash-exp"
|
| 35 |
|
| 36 |
# HuggingFace Configuration
|
| 37 |
-
HF_MODEL = "
|
| 38 |
-
#
|
|
|
|
| 39 |
|
| 40 |
# Groq Configuration
|
| 41 |
GROQ_MODEL = "openai/gpt-oss-120b"
|
|
|
|
| 34 |
GEMINI_MODEL = "gemini-2.0-flash-exp"
|
| 35 |
|
| 36 |
# HuggingFace Configuration
|
| 37 |
+
HF_MODEL = "openai/gpt-oss-120b:scaleway" # OpenAI's 120B open source model, strong reasoning
|
| 38 |
+
# Previous: "meta-llama/Llama-3.3-70B-Instruct:scaleway" (failed synthesis)
|
| 39 |
+
# Previous: "Qwen/Qwen2.5-72B-Instruct" (weaker at handling transcription errors)
|
| 40 |
|
| 41 |
# Groq Configuration
|
| 42 |
GROQ_MODEL = "openai/gpt-oss-120b"
|
src/tools/youtube.py
CHANGED
|
@@ -48,6 +48,37 @@ CLEANUP_TEMP_FILES = True
|
|
| 48 |
logger = logging.getLogger(__name__)
|
| 49 |
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
# ============================================================================
|
| 52 |
# YouTube URL Parser
|
| 53 |
# =============================================================================
|
|
@@ -129,6 +160,9 @@ def get_youtube_transcript(video_id: str) -> Dict[str, Any]:
|
|
| 129 |
|
| 130 |
logger.info(f"Transcript fetched: {len(text)} characters")
|
| 131 |
|
|
|
|
|
|
|
|
|
|
| 132 |
return {
|
| 133 |
"text": text,
|
| 134 |
"video_id": video_id,
|
|
@@ -279,6 +313,9 @@ def transcribe_from_audio(video_url: str) -> Dict[str, Any]:
|
|
| 279 |
logger.warning(f"Failed to cleanup temp file: {e}")
|
| 280 |
|
| 281 |
if result["success"]:
|
|
|
|
|
|
|
|
|
|
| 282 |
return {
|
| 283 |
"text": result["text"],
|
| 284 |
"video_id": video_id,
|
|
|
|
| 48 |
logger = logging.getLogger(__name__)
|
| 49 |
|
| 50 |
|
| 51 |
+
# ============================================================================
|
| 52 |
+
# Transcript Cache
|
| 53 |
+
# ============================================================================
|
| 54 |
+
|
| 55 |
+
def save_transcript_to_cache(video_id: str, text: str, source: str) -> None:
|
| 56 |
+
"""
|
| 57 |
+
Save transcript to _cache folder for debugging.
|
| 58 |
+
|
| 59 |
+
Args:
|
| 60 |
+
video_id: YouTube video ID
|
| 61 |
+
text: Transcript text
|
| 62 |
+
source: "api" or "whisper"
|
| 63 |
+
"""
|
| 64 |
+
try:
|
| 65 |
+
cache_dir = Path("_cache")
|
| 66 |
+
cache_dir.mkdir(exist_ok=True)
|
| 67 |
+
|
| 68 |
+
cache_file = cache_dir / f"{video_id}_transcript.txt"
|
| 69 |
+
with open(cache_file, "w", encoding="utf-8") as f:
|
| 70 |
+
f.write(f"# YouTube Transcript\n")
|
| 71 |
+
f.write(f"# Video ID: {video_id}\n")
|
| 72 |
+
f.write(f"# Source: {source}\n")
|
| 73 |
+
f.write(f"# Length: {len(text)} characters\n")
|
| 74 |
+
f.write(f"# Generated: {__import__('datetime').datetime.now().isoformat()}\n")
|
| 75 |
+
f.write(f"\n{text}\n")
|
| 76 |
+
|
| 77 |
+
logger.info(f"Transcript saved to cache: {cache_file}")
|
| 78 |
+
except Exception as e:
|
| 79 |
+
logger.warning(f"Failed to save transcript to cache: {e}")
|
| 80 |
+
|
| 81 |
+
|
| 82 |
# ============================================================================
|
| 83 |
# YouTube URL Parser
|
| 84 |
# =============================================================================
|
|
|
|
| 160 |
|
| 161 |
logger.info(f"Transcript fetched: {len(text)} characters")
|
| 162 |
|
| 163 |
+
# Save to cache for debugging
|
| 164 |
+
save_transcript_to_cache(video_id, text, "api")
|
| 165 |
+
|
| 166 |
return {
|
| 167 |
"text": text,
|
| 168 |
"video_id": video_id,
|
|
|
|
| 313 |
logger.warning(f"Failed to cleanup temp file: {e}")
|
| 314 |
|
| 315 |
if result["success"]:
|
| 316 |
+
# Save to cache for debugging
|
| 317 |
+
save_transcript_to_cache(video_id, result["text"], "whisper")
|
| 318 |
+
|
| 319 |
return {
|
| 320 |
"text": result["text"],
|
| 321 |
"video_id": video_id,
|