feat: phase1 planning and video processing research
Browse files- Rewrite PLAN.md: focus on system error fixes (current 10% → 30% target)
- Add brainstorming_phase1_youtube.md: YouTube transcript approach research
- Add _template_original/: static reference for comparison
- Update CHANGELOG.md: course test setup analysis, 20 fixed questions documented
- Research findings:
* Transcript-first approach: 1-3s vs frame extraction 1-10min total
* Frame extraction is fast (5-20s), vision processing is slow
* Whisper open-source on ZeroGPU: 5-10x speedup, free
* Unified Phase 1+2 architecture: single transcribe_audio() function
Co-Authored-By: Claude <noreply@anthropic.com>
- CHANGELOG.md +148 -0
- PLAN.md +237 -597
- README.md +1 -1
- _template_original/README.md +15 -0
- _template_original/app.py +196 -0
- _template_original/requirements.txt +2 -0
- brainstorming_phase1_youtube.md +345 -0
- dev/dev_260102_13_stage2_tool_development.md +23 -16
- dev/dev_260102_14_stage3_core_logic.md +10 -6
- test/README.md +3 -2
CHANGELOG.md
CHANGED
|
@@ -1,5 +1,153 @@
|
|
| 1 |
# Session Changelog
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
|
| 4 |
|
| 5 |
**Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
|
|
|
|
| 1 |
# Session Changelog
|
| 2 |
|
| 3 |
+
## [2026-01-12] [Analysis] [COMPLETED] Course API Test Setup - Fixed vs Variable
|
| 4 |
+
|
| 5 |
+
**Purpose:** Understand which parts of template are FIXED (course API contract) vs CAN MODIFY (our improvements).
|
| 6 |
+
|
| 7 |
+
**Critical Finding:** Course API has a FIXED test setup - questions are NOT random.
|
| 8 |
+
|
| 9 |
+
### Fixed (Course API Contract - DO NOT CHANGE)
|
| 10 |
+
|
| 11 |
+
| Aspect | Value | Cannot Change |
|
| 12 |
+
|--------|-------|----------------|
|
| 13 |
+
| **API Endpoint** | `agents-course-unit4-scoring.hf.space` | ❌ |
|
| 14 |
+
| **Questions Route** | `GET /questions` | ❌ |
|
| 15 |
+
| **Submit Route** | `POST /submit` | ❌ |
|
| 16 |
+
| **Number of Questions** | **20** (always 20) | ❌ |
|
| 17 |
+
| **Question Source** | GAIA validation set, level 1 | ❌ |
|
| 18 |
+
| **Randomness** | **NO - Fixed set** | ❌ |
|
| 19 |
+
| **Difficulty** | All level 1 (easiest) | ❌ |
|
| 20 |
+
| **Filter Criteria** | By tools/steps complexity | ❌ |
|
| 21 |
+
| **Scoring** | EXACT MATCH | ❌ |
|
| 22 |
+
| **Target Score** | 30% = 6/20 correct | ❌ |
|
| 23 |
+
|
| 24 |
+
### The 20 Questions (ALWAYS the Same)
|
| 25 |
+
|
| 26 |
+
| # | Full Task ID | Description | Tools Required |
|
| 27 |
+
|---|--------------|-------------|----------------|
|
| 28 |
+
| 1 | `2d83110e-a098-4ebb-9987-066c06fa42d0` | Reverse sentence (calculator) | Calculator |
|
| 29 |
+
| 2 | `4fc2f1ae-8625-45b5-ab34-ad4433bc21f8` | Wikipedia dinosaur nomination | Web search |
|
| 30 |
+
| 3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | Video processing |
|
| 31 |
+
| 4 | `8e867cd7-cff9-4e6c-867a-ff5ddc2550be` | Mercedes Sosa albums count | Web search |
|
| 32 |
+
| 5 | `9d191bce-651d-4746-be2d-7ef8ecadb9c2` | YouTube video - Teal'c quote | Video processing |
|
| 33 |
+
| 6 | `6f37996b-2ac7-44b0-8e68-6d28256631b4` | Operation table commutativity | CSV file |
|
| 34 |
+
| 7 | `cca530fc-4052-43b2-b130-b30968d8aa44` | Chess position - winning move | Image analysis |
|
| 35 |
+
| 8 | `3cef3a44-215e-4aed-8e3b-b1e3f08063b7` | Grocery list - vegetables only | Knowledge |
|
| 36 |
+
| 9 | `305ac316-eef6-4446-960a-92d80d542f82` | Polish Ray actor character | Web search |
|
| 37 |
+
| 10 | `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` | Strawberry pie recipe | MP3 audio |
|
| 38 |
+
| 11 | `cabe07ed-9eca-40ea-8ead-410ef5e83f91` | Equine veterinarian surname | Web search |
|
| 39 |
+
| 12 | `f918266a-b3e0-4914-865d-4faa564f1aef` | Python code output | Python execution |
|
| 40 |
+
| 13 | `1f975693-876d-457b-a649-393859e79bf3` | Calculus audio - page numbers | MP3 audio |
|
| 41 |
+
| 14 | `840bfca7-4f7b-481a-8794-c560c340185d` | NASA award number | PDF processing |
|
| 42 |
+
| 15 | `bda648d7-d618-4883-88f4-3466eabd860e` | Vietnamese specimens city | Web search |
|
| 43 |
+
| 16 | `3f57289b-8c60-48be-bd80-01f8099ca449` | Yankee at-bats count | Web search |
|
| 44 |
+
| 17 | `a0c07678-e491-4bbc-8f0b-07405144218f` | Pitcher numbers (before/after) | Web search |
|
| 45 |
+
| 18 | `cf106601-ab4f-4af9-b045-5295fe67b37d` | Olympics least athletes | Web search |
|
| 46 |
+
| 19 | `5a0c1adf-205e-4841-a666-7c3ef95def9d` | Malko Competition recipient | Web search |
|
| 47 |
+
| 20 | `7bd855d8-463d-4ed5-93ca-5fe35145f733` | Excel food sales calculation | Excel file |
|
| 48 |
+
|
| 49 |
+
**NOT random** - same 20 questions every submission!
|
| 50 |
+
|
| 51 |
+
### Template Contract (MUST Preserve)
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
# REQUIRED - Do NOT change
|
| 55 |
+
questions_url = f"{api_url}/questions" # Fixed route
|
| 56 |
+
submit_url = f"{api_url}/submit" # Fixed route
|
| 57 |
+
|
| 58 |
+
submission_data = {
|
| 59 |
+
"username": username,
|
| 60 |
+
"agent_code": agent_code,
|
| 61 |
+
"answers": answers_payload # Fixed format
|
| 62 |
+
}
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
### Our Additions (SAFE to Modify)
|
| 66 |
+
|
| 67 |
+
| Feature | Purpose | Required? |
|
| 68 |
+
|---------|---------|-----------|
|
| 69 |
+
| Question Limit | Debug: run first N | ✅ Optional |
|
| 70 |
+
| Target Task IDs | Debug: run specific | ✅ Optional |
|
| 71 |
+
| ThreadPoolExecutor | Speed: concurrent | ✅ Optional |
|
| 72 |
+
| System Error Field | UX: error tracking | ✅ Optional |
|
| 73 |
+
| File Download (HF) | Feature: support files | ✅ Optional |
|
| 74 |
+
|
| 75 |
+
### Key Learnings
|
| 76 |
+
|
| 77 |
+
1. **Question set is FIXED** - not random, always same 20
|
| 78 |
+
2. **API routes are FIXED** - cannot change endpoints
|
| 79 |
+
3. **Submission format is FIXED** - must match exactly
|
| 80 |
+
4. **Our additions are OPTIONAL** - debug/features we added
|
| 81 |
+
5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
|
| 82 |
+
|
| 83 |
+
**Reference:** `_template_original/app.py` for original structure
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## [2026-01-12] [Infrastructure] [COMPLETED] Original Template Reference Added
|
| 88 |
+
|
| 89 |
+
**Purpose:** Compare current work with original template to understand changes and avoid breaking template structure.
|
| 90 |
+
|
| 91 |
+
**Process:**
|
| 92 |
+
1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
|
| 93 |
+
2. Removed git-specific files (`.git/` folder, `.gitattributes`)
|
| 94 |
+
3. Copied to project as `_template_original/` (static reference, no git)
|
| 95 |
+
4. Cleaned up temporary clone from Downloads
|
| 96 |
+
|
| 97 |
+
**Why Static Reference:**
|
| 98 |
+
- No `.git/` folder → won't interfere with project's git
|
| 99 |
+
- No `.gitattributes` → clean file comparison
|
| 100 |
+
- Pure reference material for diff/comparison
|
| 101 |
+
- Can see exactly what changed from original
|
| 102 |
+
|
| 103 |
+
**Template Original Contents:**
|
| 104 |
+
- `app.py` (8777 bytes - original)
|
| 105 |
+
- `README.md` (400 bytes - original)
|
| 106 |
+
- `requirements.txt` (15 bytes - original)
|
| 107 |
+
|
| 108 |
+
**Comparison Commands:**
|
| 109 |
+
```bash
|
| 110 |
+
# Compare file sizes
|
| 111 |
+
ls -lh _template_original/app.py app.py
|
| 112 |
+
|
| 113 |
+
# See differences
|
| 114 |
+
diff _template_original/app.py app.py
|
| 115 |
+
|
| 116 |
+
# Count lines added
|
| 117 |
+
wc -l app.py _template_original/app.py
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
**Created Files:**
|
| 121 |
+
- **_template_original/** (NEW) - Static reference to original template (3 files)
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## [2026-01-12] [Infrastructure] [COMPLETED] HuggingFace Space Renamed
|
| 126 |
+
|
| 127 |
+
**Context:** User wanted to compare current work with original template. Needed to rename current Space to free up `Final_Assignment_Template` name.
|
| 128 |
+
|
| 129 |
+
**Actions Taken:**
|
| 130 |
+
1. Renamed HuggingFace Space: `mangubee/Final_Assignment_Template` → `mangubee/agentbee`
|
| 131 |
+
2. Updated local git remote to point to new URL
|
| 132 |
+
3. Committed all today's changes (system error field, calculator fix, target task IDs, docs)
|
| 133 |
+
4. Pulled from remote (sync after rename - already up to date)
|
| 134 |
+
5. Pushed commits to renamed Space: `c86df49..41ac444`
|
| 135 |
+
|
| 136 |
+
**Key Learnings:**
|
| 137 |
+
- Local folder name ≠ git repo identity (can rename locally without affecting remote)
|
| 138 |
+
- Git remote URL determines push destination (updated to `agentbee`)
|
| 139 |
+
- HuggingFace Space name is independent of local folder name
|
| 140 |
+
- All work preserved through rename process
|
| 141 |
+
|
| 142 |
+
**Current State:**
|
| 143 |
+
- Local: `Final_Assignment_Template/` (folder name unchanged for convenience)
|
| 144 |
+
- Remote: `mangubee/agentbee` (renamed on HuggingFace)
|
| 145 |
+
- Sync: ✅ All changes pushed
|
| 146 |
+
- Git: All commits synced
|
| 147 |
+
- Template: `_template_original/` added for comparison
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
## [2026-01-12] [Documentation] [COMPLETED] Course vs Official GAIA Clarification
|
| 152 |
|
| 153 |
**Problem:** Confusion about which leaderboard we're submitting to. Mistakenly thought we needed to submit to official GAIA, but we're actually implementing the course assignment API.
|
PLAN.md
CHANGED
|
@@ -1,698 +1,338 @@
|
|
| 1 |
-
# Implementation Plan -
|
| 2 |
|
| 3 |
-
**Date:** 2026-01-
|
| 4 |
-
**Status:**
|
|
|
|
|
|
|
| 5 |
|
| 6 |
## Objective
|
| 7 |
|
| 8 |
-
Fix
|
| 9 |
|
| 10 |
-
## Current
|
| 11 |
|
| 12 |
-
|
| 13 |
-
2. **No HuggingFace vision support** - HF Inference API integration missing multimodal capability
|
| 14 |
-
3. **Inconsistent routing** - Planning/tool selection respect UI, vision doesn't
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
###
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
|
| 28 |
-
- Route to correct provider: `gemini`, `huggingface`, `groq`, `claude`
|
| 29 |
-
- Respect `ENABLE_LLM_FALLBACK` setting
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
| 42 |
|
| 43 |
-
**
|
| 44 |
|
| 45 |
-
**
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
|
| 50 |
-
- Supports: images, video, text
|
| 51 |
-
- API: HuggingFace Inference API (paid tier)
|
| 52 |
-
- Format: Base64 image + text prompt
|
| 53 |
-
|
| 54 |
-
2. **meta-llama/Llama-3.2-90B-Vision-Instruct**
|
| 55 |
-
|
| 56 |
-
- 90B parameters, multimodal
|
| 57 |
-
- Supports: images + text
|
| 58 |
-
- API: HuggingFace Inference API
|
| 59 |
-
|
| 60 |
-
3. **microsoft/Phi-3.5-vision-instruct**
|
| 61 |
-
- Smaller model (3.8B), efficient
|
| 62 |
-
- Supports: images + text
|
| 63 |
-
- Good for testing/debugging
|
| 64 |
|
| 65 |
**Implementation:**
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
**Pros:**
|
| 72 |
-
|
| 73 |
-
- ✅
|
| 74 |
-
- ✅
|
| 75 |
-
- ✅
|
| 76 |
-
- ✅ Consistent with current architecture
|
| 77 |
|
| 78 |
**Cons:**
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
|
| 81 |
-
- ❌ Need to verify which models work with Inference API
|
| 82 |
-
|
| 83 |
-
#### Option B: Image-to-Text Preprocessing
|
| 84 |
-
|
| 85 |
-
**Approach:** Convert images to text descriptions using separate tool, then feed to text-only LLM
|
| 86 |
-
|
| 87 |
-
**Tools available:**
|
| 88 |
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
- Image captioning model
|
| 92 |
-
- Converts image → text description
|
| 93 |
-
|
| 94 |
-
2. **LLaVA** (llava-hf/llava-1.5-7b-hf)
|
| 95 |
-
|
| 96 |
-
- Vision-language assistant
|
| 97 |
-
- Image → detailed text
|
| 98 |
-
|
| 99 |
-
3. **OpenCV + OCR** (pytesseract)
|
| 100 |
-
- Extract text from images
|
| 101 |
-
- Good for documents/screenshots
|
| 102 |
|
| 103 |
**Implementation:**
|
|
|
|
|
|
|
|
|
|
| 104 |
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
- Two-step process: vision → text → reasoning
|
| 108 |
-
|
| 109 |
-
**Pros:**
|
| 110 |
-
|
| 111 |
-
- ✅ Works with any text-only LLM
|
| 112 |
-
- ✅ Cheaper (can use smaller vision models)
|
| 113 |
-
- ✅ Fallback option if multimodal API unavailable
|
| 114 |
-
|
| 115 |
-
**Cons:**
|
| 116 |
-
|
| 117 |
-
- ❌ Two API calls (slower)
|
| 118 |
-
- ❌ Information loss in image → text conversion
|
| 119 |
-
- ❌ Poor for complex visual reasoning (chess positions, video analysis)
|
| 120 |
-
- ❌ Extra dependency management
|
| 121 |
-
|
| 122 |
-
## Recommended Approach
|
| 123 |
-
|
| 124 |
-
**Use Option A: Direct Multimodal LLM (Qwen2-VL-72B-Instruct)**
|
| 125 |
-
|
| 126 |
-
**Reasoning:**
|
| 127 |
-
|
| 128 |
-
1. User has HuggingFace paid tier access (confirmed)
|
| 129 |
-
2. GAIA questions require complex visual reasoning (chess positions, video analysis)
|
| 130 |
-
3. Simpler architecture - consistent with existing pattern
|
| 131 |
-
4. Better accuracy for benchmark performance
|
| 132 |
-
5. Focus on HF testing first, Groq later
|
| 133 |
-
|
| 134 |
-
**Fallback:** Keep Option B as backup if multimodal API doesn't work
|
| 135 |
-
|
| 136 |
-
## Implementation Steps
|
| 137 |
-
|
| 138 |
-
### Phase 0: API Validation (CRITICAL - DO THIS FIRST)
|
| 139 |
-
|
| 140 |
-
**Goal:** Validate HuggingFace Inference API supports vision BEFORE implementation
|
| 141 |
-
|
| 142 |
-
**Decision Gate 1:** Only proceed to Phase 1 if at least one model works
|
| 143 |
|
| 144 |
-
#### Step
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
- [ ] Simple test: Load apple image, ask "What is this?"
|
| 150 |
-
- [ ] Verify API accepts vision input (base64, URL, or file path)
|
| 151 |
-
- [ ] Document response format and error patterns
|
| 152 |
-
|
| 153 |
-
#### Step 0.2: Test Image Format Support
|
| 154 |
-
|
| 155 |
-
- [ ] Base64 encoding in messages
|
| 156 |
-
- [ ] Direct URL support
|
| 157 |
-
- [ ] Local file path support
|
| 158 |
-
- [ ] Document which format(s) work
|
| 159 |
-
|
| 160 |
-
#### Step 0.3: Document API Behavior
|
| 161 |
|
| 162 |
-
|
| 163 |
-
- [ ] Error patterns (quota, rate limit, invalid input)
|
| 164 |
-
- [ ] Rate limits and quotas
|
| 165 |
-
- [ ] Model selection recommendation
|
| 166 |
|
| 167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
-
- [ ] **NO-GO:** 0 models work → Pivot to backup options:
|
| 171 |
-
- **Option C:** HF Spaces deployment (custom endpoint)
|
| 172 |
-
- **Option D:** Local transformers library (no API)
|
| 173 |
-
- **Option E:** Hybrid (HF text + Gemini/Claude vision only)
|
| 174 |
|
| 175 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
| 3 | `Qwen/Qwen3-VL-30B-A3B-Instruct` | Novita | ~14s | Qwen brand, reputable |
|
| 184 |
-
| 4 | `zai-org/GLM-4.6V-Flash` | zai-org | ~16s | Zhipu AI brand |
|
| 185 |
|
| 186 |
-
**
|
| 187 |
-
**Test image:** 2.1MB workspace photo (realistic large image)
|
| 188 |
|
| 189 |
---
|
| 190 |
|
| 191 |
-
### Phase
|
| 192 |
|
| 193 |
-
**Goal:**
|
| 194 |
|
| 195 |
-
**
|
| 196 |
|
| 197 |
-
|
| 198 |
-
- Format: Base64 encoding in messages array
|
| 199 |
-
- Timeout: ~6 seconds for 2.1MB image
|
| 200 |
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
- [ ] Modify `analyze_image()` to check `os.getenv("LLM_PROVIDER")`
|
| 214 |
-
- [ ] Add routing logic (each provider fails independently):
|
| 215 |
-
|
| 216 |
-
```python
|
| 217 |
-
if provider == "huggingface":
|
| 218 |
-
return analyze_image_hf(image_path, question) # Fail if error
|
| 219 |
-
elif provider == "gemini":
|
| 220 |
-
return analyze_image_gemini(image_path, question) # Fail if error
|
| 221 |
-
elif provider == "claude":
|
| 222 |
-
return analyze_image_claude(image_path, question) # Fail if error
|
| 223 |
-
# NO fallback chains during testing - defeats isolation purpose
|
| 224 |
-
```
|
| 225 |
|
| 226 |
-
|
| 227 |
-
-
|
|
|
|
| 228 |
|
| 229 |
-
|
|
|
|
|
|
|
|
|
|
| 230 |
|
| 231 |
-
|
| 232 |
-
- [ ] Update `src/config/settings.py` with vision model setting
|
| 233 |
-
- [ ] Document alternatives (Qwen/Qwen3-VL-8B-Instruct for small images only)
|
| 234 |
|
| 235 |
---
|
| 236 |
|
| 237 |
-
### Phase
|
| 238 |
-
|
| 239 |
-
**Goal:** Validate basic vision works before complex GAIA questions
|
| 240 |
-
|
| 241 |
-
**Decision Gate 2:** Only proceed to Phase 3 if ≥3/4 smoke tests pass
|
| 242 |
-
|
| 243 |
-
#### Step 2.1: Simple Image Description Test
|
| 244 |
|
| 245 |
-
|
| 246 |
-
- [ ] Question: "Describe this image"
|
| 247 |
-
- [ ] Expected: Basic object recognition works
|
| 248 |
-
- [ ] Export: `output/smoke_test_description.json`
|
| 249 |
|
| 250 |
-
|
| 251 |
|
| 252 |
-
|
| 253 |
-
- [ ] Question: "What text do you see?"
|
| 254 |
-
- [ ] Expected: Text extraction works
|
| 255 |
-
- [ ] Export: `output/smoke_test_ocr.json`
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
-
|
| 263 |
|
| 264 |
-
|
| 265 |
|
| 266 |
-
|
| 267 |
-
- [ ] Run with HuggingFace provider
|
| 268 |
-
- [ ] Verify end-to-end integration works
|
| 269 |
-
- [ ] Export: `output/smoke_test_gaia_single.json`
|
| 270 |
-
|
| 271 |
-
#### Step 2.5: Decision Gate - GO/NO-GO
|
| 272 |
-
|
| 273 |
-
- [ ] **GO:** ≥3/4 smoke tests pass → Proceed to Phase 3
|
| 274 |
-
- [ ] **NO-GO:** <3/4 pass → Debug before GAIA evaluation
|
| 275 |
|
| 276 |
---
|
| 277 |
|
| 278 |
-
### Phase
|
| 279 |
-
|
| 280 |
-
**Goal:** Test HuggingFace vision on full GAIA benchmark
|
| 281 |
-
|
| 282 |
-
#### Step 3.1: Run Full GAIA Evaluation (HuggingFace Only)
|
| 283 |
-
|
| 284 |
-
- [ ] Set `LLM_PROVIDER=huggingface` in UI
|
| 285 |
-
- [ ] Run all 20 questions
|
| 286 |
-
- [ ] Export: `output/gaia_results_hf_TIMESTAMP.json` (HF only, no mixing)
|
| 287 |
-
- [ ] Log which questions use vision tool vs other tools
|
| 288 |
|
| 289 |
-
|
| 290 |
|
| 291 |
-
|
| 292 |
-
- [ ] Break down by question type:
|
| 293 |
-
- Vision questions: X/8 correct
|
| 294 |
-
- Non-vision questions: X/12 correct
|
| 295 |
-
- [ ] Identify failure patterns (vision errors, wrong answers, tool selection errors)
|
| 296 |
-
- [ ] Compare to 0% baseline
|
| 297 |
|
| 298 |
-
|
| 299 |
|
| 300 |
-
|
|
|
|
|
|
|
|
|
|
| 301 |
|
| 302 |
-
|
| 303 |
-
| --------------------- | ---------------- | -------- | -------------- |
|
| 304 |
-
| HuggingFace (Phi-3.5) | 8/8 attempted | X% | [observations] |
|
| 305 |
-
| Gemini (baseline) | 8/8 attempted | Y% | [comparison] |
|
| 306 |
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
- [ ] **If accuracy ≥20%:** Good enough, proceed to Phase 4 (media processing)
|
| 310 |
-
- [ ] **If accuracy <20%:** Analyze failures, try larger HF model (Llama-3.2 or Qwen2-VL)
|
| 311 |
-
- [ ] **If accuracy <5%:** Re-evaluate approach, consider backup options
|
| 312 |
|
| 313 |
---
|
| 314 |
|
| 315 |
-
### Phase
|
| 316 |
-
|
| 317 |
-
**Goal:** Add YouTube and audio support
|
| 318 |
|
| 319 |
-
|
| 320 |
|
| 321 |
-
|
| 322 |
-
- [ ] Use `youtube-transcript-api` library
|
| 323 |
-
- [ ] Extract dialogue/captions as text
|
| 324 |
-
- [ ] Pass transcript to LLM for question answering
|
| 325 |
-
- [ ] Test on GAIA YouTube questions (bird species, Stargate quote)
|
| 326 |
-
- [ ] Export: `output/gaia_results_hf_with_youtube.json`
|
| 327 |
|
| 328 |
-
|
|
|
|
|
|
|
| 329 |
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
-
|
| 333 |
-
-
|
| 334 |
-
-
|
| 335 |
-
- [ ] Export: `output/gaia_results_hf_with_audio.json`
|
| 336 |
-
|
| 337 |
-
---
|
| 338 |
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
#### Step 5.1: Add Groq Vision Support
|
| 344 |
-
|
| 345 |
-
- [ ] Implement `analyze_image_groq()` using Llama-3.2-90B-Vision
|
| 346 |
-
- [ ] Add to vision tool routing (independent, no fallback)
|
| 347 |
-
- [ ] Test with Groq free tier (30 req/min)
|
| 348 |
-
- [ ] Export: `output/gaia_results_groq_TIMESTAMP.json`
|
| 349 |
-
- [ ] Compare accuracy: HF vs Groq
|
| 350 |
|
| 351 |
---
|
| 352 |
|
| 353 |
-
### Phase 6: Final Verification
|
| 354 |
-
|
| 355 |
-
**Goal:** Document final results and verify all tests pass
|
| 356 |
-
|
| 357 |
-
#### Step 6.1: Final GAIA Evaluation (All Media Types)
|
| 358 |
-
|
| 359 |
-
- [ ] Run all 20 questions with HuggingFace
|
| 360 |
-
- [ ] Verify: images, videos, audio all work
|
| 361 |
-
- [ ] Export: `output/gaia_results_final_TIMESTAMP.json`
|
| 362 |
-
- [ ] Document final accuracy vs 0% baseline
|
| 363 |
-
|
| 364 |
-
#### Step 6.2: Regression Testing
|
| 365 |
-
|
| 366 |
-
- [ ] Run all 99 tests
|
| 367 |
-
- [ ] Verify no regressions introduced
|
| 368 |
-
- [ ] Fix any broken tests
|
| 369 |
-
|
| 370 |
-
#### Step 6.3: Documentation
|
| 371 |
-
|
| 372 |
-
- [ ] Update CHANGELOG.md with final results
|
| 373 |
-
- [ ] Update README.md with HF vision support
|
| 374 |
-
- [ ] Document model selection strategy
|
| 375 |
-
|
| 376 |
-
## Files to Modify
|
| 377 |
-
|
| 378 |
-
### Phase 0-1: Core Vision Integration
|
| 379 |
-
|
| 380 |
-
1. **src/tools/vision.py** (~150 lines added/modified)
|
| 381 |
-
|
| 382 |
-
- Add `analyze_image_hf()` function (Phase 1)
|
| 383 |
-
- Modify `analyze_image()` routing logic - NO FALLBACKS (Phase 1)
|
| 384 |
-
- Add retry logic with exponential backoff
|
| 385 |
-
- Clear error messages for debugging
|
| 386 |
-
|
| 387 |
-
2. **.env** (~3 lines added)
|
| 388 |
-
|
| 389 |
-
- Add `HF_VISION_MODEL=microsoft/Phi-3.5-vision-instruct` (start small)
|
| 390 |
-
- Document alternatives: Llama-3.2-11B-Vision, Qwen2-VL-72B
|
| 391 |
-
|
| 392 |
-
3. **src/config/settings.py** (~5 lines)
|
| 393 |
-
- Add `hf_vision_model` setting
|
| 394 |
-
- Load from environment variable
|
| 395 |
-
|
| 396 |
-
### Phase 2-3: Testing Infrastructure
|
| 397 |
-
|
| 398 |
-
1. **test/test_vision_smoke.py** (NEW - ~100 lines)
|
| 399 |
-
|
| 400 |
-
- Smoke test suite: description, OCR, counting, single GAIA
|
| 401 |
-
- Export individual test results
|
| 402 |
-
|
| 403 |
-
2. **app.py** (optional - ~10 lines)
|
| 404 |
-
- Update export filenames to include provider: `gaia_results_hf_TIMESTAMP.json`
|
| 405 |
-
- Separate results per provider for capability matrix
|
| 406 |
-
|
| 407 |
-
### Phase 4: Media Processing
|
| 408 |
-
|
| 409 |
-
1. **src/tools/youtube.py** (NEW - ~80 lines)
|
| 410 |
-
|
| 411 |
-
- YouTube transcript extraction
|
| 412 |
-
- Use `youtube-transcript-api`
|
| 413 |
-
|
| 414 |
-
2. **src/tools/audio.py** (NEW - ~80 lines)
|
| 415 |
-
|
| 416 |
-
- Audio transcription (Whisper or HF audio models)
|
| 417 |
-
- Convert audio → text
|
| 418 |
-
|
| 419 |
-
3. **src/tools/**init**.py** (~10 lines)
|
| 420 |
-
|
| 421 |
-
- Register new tools: youtube_transcript, audio_transcribe
|
| 422 |
-
|
| 423 |
-
4. **requirements.txt** (~3 lines)
|
| 424 |
-
- Add `youtube-transcript-api`
|
| 425 |
-
- Add `openai-whisper` or HF audio model library
|
| 426 |
-
|
| 427 |
-
### Phase 6: Documentation
|
| 428 |
-
|
| 429 |
-
1. **README.md** (~30 lines modified)
|
| 430 |
-
- Document HF vision support
|
| 431 |
-
- List model options and selection strategy
|
| 432 |
-
- Update architecture diagram with media processing tools
|
| 433 |
-
|
| 434 |
## Success Criteria
|
| 435 |
|
| 436 |
-
### Phase
|
| 437 |
-
|
| 438 |
-
- [ ]
|
| 439 |
-
- [ ]
|
| 440 |
-
- [ ]
|
| 441 |
-
|
| 442 |
-
### Phase
|
| 443 |
-
|
| 444 |
-
- [ ]
|
| 445 |
-
- [ ]
|
| 446 |
-
- [ ]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 447 |
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
- [ ] ≥3/4 smoke tests pass
|
| 451 |
-
- [ ] Basic vision capabilities validated
|
| 452 |
-
|
| 453 |
-
### Phase 3: GAIA Evaluation
|
| 454 |
-
|
| 455 |
-
- [ ] UI LLM selection propagates to vision tool
|
| 456 |
-
- [ ] HuggingFace-only results exported: `output/gaia_results_hf_TIMESTAMP.json`
|
| 457 |
-
- [ ] Accuracy measured and compared to 0% baseline
|
| 458 |
-
- [ ] Capability matrix built (per-provider comparison)
|
| 459 |
-
|
| 460 |
-
### Phase 4-6: Full Coverage
|
| 461 |
-
|
| 462 |
-
- [ ] YouTube video questions work (transcript extraction)
|
| 463 |
-
- [ ] Audio questions work (transcription)
|
| 464 |
-
- [ ] All 99 tests still passing
|
| 465 |
-
- [ ] Final accuracy ≥20% (minimum acceptable)
|
| 466 |
-
|
| 467 |
-
## Backup Strategy Options
|
| 468 |
-
|
| 469 |
-
If Phase 0 reveals HF Inference API doesn't support vision:
|
| 470 |
-
|
| 471 |
-
### Option C: HuggingFace Spaces Deployment
|
| 472 |
-
|
| 473 |
-
- Deploy custom vision model to HF Spaces
|
| 474 |
-
- Use Inference Endpoints (paid tier)
|
| 475 |
-
- More control, higher cost
|
| 476 |
-
|
| 477 |
-
### Option D: Local Transformers Library
|
| 478 |
-
|
| 479 |
-
- Use `transformers` library directly (no API)
|
| 480 |
-
- Load model locally: `AutoModelForVision2Seq`
|
| 481 |
-
- Slower, requires GPU, but guaranteed to work
|
| 482 |
-
|
| 483 |
-
### Option E: Hybrid Architecture
|
| 484 |
-
|
| 485 |
-
- Keep HuggingFace for text-only LLM
|
| 486 |
-
- Use Gemini/Claude for vision only
|
| 487 |
-
- Compromise: HF testing focus, but vision delegates to working providers
|
| 488 |
-
|
| 489 |
-
## Decision Gates Summary
|
| 490 |
-
|
| 491 |
-
**Gate 1 (Phase 0):** Does HF API support vision?
|
| 492 |
-
|
| 493 |
-
- **GO:** ≥1 model works → Phase 1
|
| 494 |
-
- **NO-GO:** 0 models work → Pivot to Option C/D/E
|
| 495 |
-
|
| 496 |
-
**Gate 2 (Phase 2):** Do smoke tests pass?
|
| 497 |
|
| 498 |
-
|
| 499 |
-
|
|
|
|
|
|
|
| 500 |
|
| 501 |
-
|
|
|
|
|
|
|
|
|
|
| 502 |
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
|
| 507 |
-
##
|
| 508 |
|
| 509 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 510 |
|
| 511 |
-
|
| 512 |
-
- Test Llama-3.2-11B-Vision-Instruct
|
| 513 |
-
- Test Qwen2-VL-72B-Instruct
|
| 514 |
|
| 515 |
-
|
|
|
|
|
|
|
| 516 |
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
|
| 521 |
-
3
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
- Rate limits and quotas
|
| 525 |
|
| 526 |
## Next Actions
|
| 527 |
|
| 528 |
-
**Phase
|
| 529 |
|
| 530 |
-
1.
|
| 531 |
-
2.
|
| 532 |
-
3.
|
| 533 |
-
4.
|
|
|
|
|
|
|
| 534 |
|
| 535 |
-
|
| 536 |
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
**Goal:** Enable agent to download and process file attachments from GAIA questions
|
| 540 |
-
|
| 541 |
-
**Problem:**
|
| 542 |
-
- Current code ignores `file_name` field in GAIA questions
|
| 543 |
-
- Files not downloaded from `GET /files/{task_id}` endpoint
|
| 544 |
-
- Vision/file parsing tools fail with placeholder `<provided_image_path>`
|
| 545 |
-
- ~40% of questions (8/20) fail due to missing file handling
|
| 546 |
-
|
| 547 |
-
### Root Cause
|
| 548 |
-
|
| 549 |
-
**GAIA Question Structure:**
|
| 550 |
-
```json
|
| 551 |
-
{
|
| 552 |
-
"task_id": "abc123",
|
| 553 |
-
"question": "What's in this image?",
|
| 554 |
-
"file_name": "chess.png", // NULL if no file
|
| 555 |
-
"file_path": "/files/abc123" // NULL if no file
|
| 556 |
-
}
|
| 557 |
-
```
|
| 558 |
-
|
| 559 |
-
**Current Code (app.py:249-290):**
|
| 560 |
-
```python
|
| 561 |
-
def process_single_question(agent, item, index, total):
|
| 562 |
-
task_id = item.get("task_id")
|
| 563 |
-
question_text = item.get("question")
|
| 564 |
-
# ❌ MISSING: Check file_name
|
| 565 |
-
# ❌ MISSING: Download file
|
| 566 |
-
# ❌ MISSING: Pass file_path to agent
|
| 567 |
-
|
| 568 |
-
submitted_answer = agent(question_text) # No file handling
|
| 569 |
-
```
|
| 570 |
-
|
| 571 |
-
**Result:** LLM generates `vision(image_path="<provided_image_path>")` → File not found error
|
| 572 |
-
|
| 573 |
-
### Solution Architecture
|
| 574 |
-
|
| 575 |
-
**Step 1: Add File Download Function**
|
| 576 |
-
|
| 577 |
-
```python
|
| 578 |
-
def download_task_file(task_id: str, save_dir: str = "input/") -> Optional[str]:
|
| 579 |
-
"""Download file attached to a GAIA question.
|
| 580 |
-
|
| 581 |
-
Args:
|
| 582 |
-
task_id: Question's task_id
|
| 583 |
-
save_dir: Directory to save file
|
| 584 |
-
|
| 585 |
-
Returns:
|
| 586 |
-
File path if downloaded, None if no file
|
| 587 |
-
"""
|
| 588 |
-
api_url = "https://agents-course-unit4-scoring.hf.space"
|
| 589 |
-
file_url = f"{api_url}/files/{task_id}"
|
| 590 |
-
|
| 591 |
-
response = requests.get(file_url, timeout=30)
|
| 592 |
-
response.raise_for_status()
|
| 593 |
-
|
| 594 |
-
# Get extension from Content-Type header
|
| 595 |
-
content_type = response.headers.get('Content-Type', '')
|
| 596 |
-
extension_map = {
|
| 597 |
-
'image/png': '.png',
|
| 598 |
-
'image/jpeg': '.jpg',
|
| 599 |
-
'application/pdf': '.pdf',
|
| 600 |
-
'text/csv': '.csv',
|
| 601 |
-
'application/json': '.json',
|
| 602 |
-
'application/vnd.ms-excel': '.xls',
|
| 603 |
-
'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet': '.xlsx',
|
| 604 |
-
}
|
| 605 |
-
extension = extension_map.get(content_type, '.file')
|
| 606 |
-
|
| 607 |
-
# Save file
|
| 608 |
-
Path(save_dir).mkdir(exist_ok=True)
|
| 609 |
-
file_path = f"{save_dir}{task_id}{extension}"
|
| 610 |
-
with open(file_path, 'wb') as f:
|
| 611 |
-
f.write(response.content)
|
| 612 |
-
|
| 613 |
-
return file_path
|
| 614 |
-
```
|
| 615 |
-
|
| 616 |
-
**Step 2: Modify Question Processing**
|
| 617 |
-
|
| 618 |
-
```python
|
| 619 |
-
def process_single_question(agent, item, index, total):
|
| 620 |
-
task_id = item.get("task_id")
|
| 621 |
-
question_text = item.get("question")
|
| 622 |
-
file_name = item.get("file_name") # ✅ NEW
|
| 623 |
-
|
| 624 |
-
# Download file if exists
|
| 625 |
-
file_path = None
|
| 626 |
-
if file_name:
|
| 627 |
-
file_path = download_task_file(task_id)
|
| 628 |
-
|
| 629 |
-
# Pass file info to agent
|
| 630 |
-
submitted_answer = agent(question_text, file_path=file_path) # ✅ NEW
|
| 631 |
-
```
|
| 632 |
-
|
| 633 |
-
**Step 3: Update LLM Context**
|
| 634 |
-
|
| 635 |
-
When file_path is provided, include it in the planning prompt:
|
| 636 |
-
```python
|
| 637 |
-
if file_path:
|
| 638 |
-
question_context = f"Question: {question_text}\nAttached file: {file_path}"
|
| 639 |
-
else:
|
| 640 |
-
question_context = question_text
|
| 641 |
-
```
|
| 642 |
-
|
| 643 |
-
### Implementation Steps
|
| 644 |
-
|
| 645 |
-
#### Step 7.1: Add File Download Function
|
| 646 |
-
|
| 647 |
-
- [ ] Create `download_task_file()` in app.py
|
| 648 |
-
- [ ] Handle Content-Type to extension mapping
|
| 649 |
-
- [ ] Handle 404 gracefully (no file for this task)
|
| 650 |
-
- [ ] Create input/ directory if not exists
|
| 651 |
-
|
| 652 |
-
#### Step 7.2: Modify Question Processing Loop
|
| 653 |
-
|
| 654 |
-
- [ ] Check `item.get("file_name")` in process_single_question
|
| 655 |
-
- [ ] Call download_task_file() if file_name exists
|
| 656 |
-
- [ ] Pass file_path to agent invocation
|
| 657 |
-
|
| 658 |
-
#### Step 7.3: Update Agent to Handle file_path
|
| 659 |
-
|
| 660 |
-
- [ ] Modify agent to accept optional file_path parameter
|
| 661 |
-
- [ ] Include file info in planning prompt
|
| 662 |
-
- [ ] Update tool selection to use real file paths
|
| 663 |
-
|
| 664 |
-
#### Step 7.4: Test File Handling
|
| 665 |
-
|
| 666 |
-
- [ ] Test with image question (chess position)
|
| 667 |
-
- [ ] Test with document question (Excel file)
|
| 668 |
-
- [ ] Verify no more `<provided_image_path>` errors
|
| 669 |
-
|
| 670 |
-
### Files to Modify
|
| 671 |
-
|
| 672 |
-
1. **app.py** (~80 lines added/modified)
|
| 673 |
-
- Add download_task_file() function
|
| 674 |
-
- Modify process_single_question() to handle files
|
| 675 |
-
- Add input/ directory creation
|
| 676 |
-
|
| 677 |
-
2. **src/agent/graph.py** (~20 lines)
|
| 678 |
-
- Update agent state to include file_path
|
| 679 |
-
- Pass file info to planning prompt
|
| 680 |
-
|
| 681 |
-
3. **.gitignore** (~2 lines)
|
| 682 |
-
- Add input/ to ignore downloaded files
|
| 683 |
-
|
| 684 |
-
### Success Criteria
|
| 685 |
|
| 686 |
-
|
| 687 |
-
- [ ] Document questions: parse_file tool receives real file path
|
| 688 |
-
- [ ] No more `<provided_image_path>` errors
|
| 689 |
-
- [ ] Accuracy improves from 10% toward 30%+
|
| 690 |
|
| 691 |
-
|
|
|
|
|
|
|
| 692 |
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
| Vision questions: All fail | Vision questions: Can work |
|
| 697 |
-
| Document questions: All fail | Document questions: Can work |
|
| 698 |
-
| Max accuracy: ~60% | Max accuracy: ~100% potential |
|
|
|
|
| 1 |
+
# Implementation Plan - System Error Fixes for 30% Target
|
| 2 |
|
| 3 |
+
**Date:** 2026-01-13
|
| 4 |
+
**Status:** Active
|
| 5 |
+
**Current Score:** 10% (2/20 correct)
|
| 6 |
+
**Target:** 30% (6/20 correct)
|
| 7 |
|
| 8 |
## Objective
|
| 9 |
|
| 10 |
+
Fix remaining 6 system errors to unlock questions, then address LLM quality issues to reach 30% target (6/20 correct).
|
| 11 |
|
| 12 |
+
## Current Status Analysis
|
| 13 |
|
| 14 |
+
### ✅ Working (2/20 correct - 10%)
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
| # | Task | Status | Issue |
|
| 17 |
+
|---|------|--------|-------|
|
| 18 |
+
| 9 | Polish Ray actor | ✅ Correct | - |
|
| 19 |
+
| 15 | Vietnamese specimens | ✅ Correct | - |
|
| 20 |
|
| 21 |
+
### ⚠️ System Errors (6/20 - Technical issues blocking)
|
| 22 |
|
| 23 |
+
| # | Task | Error | Type | Priority |
|
| 24 |
+
|---|------|-------|------|----------|
|
| 25 |
+
| **3** | YouTube video (bird species) | Vision tool can't handle URLs | Technical | **HIGH** |
|
| 26 |
+
| **5** | YouTube video (Teal'c) | Vision tool can't handle URLs | Technical | **HIGH** |
|
| 27 |
+
| **6** | CSV table (commutativity) | LLM tries to load `table_data.csv` | LLM Quality | MED |
|
| 28 |
+
| **10** | MP3 audio (pie recipe) | Unsupported file type | Technical | **MED** |
|
| 29 |
+
| **12** | Python code execution | Unsupported file type | Technical | **LOW** |
|
| 30 |
+
| **13** | MP3 audio (calculus) | Unsupported file type | Technical | **MED** |
|
| 31 |
|
| 32 |
+
### ❌ LLM Quality Issues (12/20 - AI can't solve)
|
| 33 |
|
| 34 |
+
| # | Task | Answer | Expected | Type |
|
| 35 |
+
|---|------|--------|----------|------|
|
| 36 |
+
| 1 | Calculator | "Unable to answer" | Right | Reasoning |
|
| 37 |
+
| 2 | Wikipedia dinosaur | "Scott Hartman" | FunkMonk | Knowledge |
|
| 38 |
+
| 4 | Mercedes Sosa albums | "Unable to answer" | 3 | Knowledge |
|
| 39 |
+
| 7 | Chess position | "Unable to answer" | Rd5 | Vision+Reasoning |
|
| 40 |
+
| 8 | Grocery list (botany) | Wrong (includes fruits) | 5 items | Knowledge |
|
| 41 |
+
| 11 | Equine veterinarian | "Unable to answer" | Louvrier | Knowledge |
|
| 42 |
+
| 14 | NASA award | "Unable to answer" | 80GSFC21M0002 | Knowledge |
|
| 43 |
+
| 16 | Yankee at-bats | "Unable to answer" | 519 | Knowledge |
|
| 44 |
+
| 17 | Pitcher numbers | "Unable to answer" | Yoshida, Uehara | Knowledge |
|
| 45 |
+
| 18 | Olympics athletes | "Unable to answer" | CUB | Knowledge |
|
| 46 |
+
| 19 | Malko Competition | "Unable to answer" | Claus | Knowledge |
|
| 47 |
+
| 20 | Excel sales | "12096.00" | "89706.00" | Calculation |
|
| 48 |
|
| 49 |
+
## Strategy
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
**Priority 1: Fix System Errors** (unlock 6 questions)
|
| 52 |
+
- YouTube videos (2 questions) - HIGH impact
|
| 53 |
+
- MP3 audio (2 questions) - Medium impact
|
| 54 |
+
- Python execution (1 question) - Low impact
|
| 55 |
+
- CSV table - LLM issue, not technical
|
| 56 |
|
| 57 |
+
**Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
|
| 58 |
+
- Better prompting
|
| 59 |
+
- Tool selection improvements
|
| 60 |
+
- Reasoning enhancements
|
| 61 |
|
| 62 |
+
## Implementation Plan
|
| 63 |
|
| 64 |
+
### Phase 1: YouTube Video Support (HIGH Priority)
|
| 65 |
|
| 66 |
+
**Goal:** Fix questions #3 and #5 (YouTube videos)
|
| 67 |
|
| 68 |
+
**Root Cause:** Vision tool tries to process YouTube URLs directly, but:
|
| 69 |
+
- YouTube videos need to be downloaded first
|
| 70 |
+
- Vision tool expects image files, not video URLs
|
| 71 |
+
- Need to extract frames or use transcript
|
| 72 |
|
| 73 |
+
**Solution Options:**
|
| 74 |
|
| 75 |
+
#### Option A: YouTube Transcript (Recommended)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
**Implementation:**
|
| 78 |
+
```python
|
| 79 |
+
# NEW: src/tools/youtube.py
|
| 80 |
+
import youtube_transcript_api
|
| 81 |
+
|
| 82 |
+
def get_youtube_transcript(video_url: str) -> str:
|
| 83 |
+
"""Extract transcript from YouTube video."""
|
| 84 |
+
try:
|
| 85 |
+
video_id = extract_video_id(video_url)
|
| 86 |
+
transcript = YouTubeTranscriptApi.get_transcript(video_id)
|
| 87 |
+
return format_transcript(transcript)
|
| 88 |
+
except Exception as e:
|
| 89 |
+
return f"ERROR: Could not extract transcript: {e}"
|
| 90 |
+
```
|
| 91 |
|
| 92 |
**Pros:**
|
| 93 |
+
- ✅ Works with current LLM (text-based)
|
| 94 |
+
- ✅ Simple API (youtube-transcript-api library)
|
| 95 |
+
- ✅ Fast, no video download needed
|
| 96 |
+
- ✅ Solves both #3 and #5
|
|
|
|
| 97 |
|
| 98 |
**Cons:**
|
| 99 |
+
- ❌ Won't work for visual-only questions (but our questions are about content)
|
| 100 |
+
- ❌ Might not capture visual details
|
| 101 |
|
| 102 |
+
**Decision:** Use transcript approach since questions ask about content (bird species, dialogue)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
#### Option B: Video Frame Extraction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
**Implementation:**
|
| 107 |
+
- Download video (yt-dlp)
|
| 108 |
+
- Extract key frames (OpenCV)
|
| 109 |
+
- Pass frames to vision tool
|
| 110 |
|
| 111 |
+
**Pros:** Visual analysis
|
| 112 |
+
**Cons:** Slow, complex, overkill for content questions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
#### Step 1.1: Install youtube-transcript-api
|
| 115 |
|
| 116 |
+
```bash
|
| 117 |
+
uv add youtube-transcript-api
|
| 118 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
+
#### Step 1.2: Create YouTube tool
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
```python
|
| 123 |
+
# src/tools/youtube.py
|
| 124 |
+
def youtube_transcript(video_url: str) -> str:
|
| 125 |
+
"""Extract transcript from YouTube video."""
|
| 126 |
+
```
|
| 127 |
|
| 128 |
+
#### Step 1.3: Register tool
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
```python
|
| 131 |
+
# src/tools/__init__.py
|
| 132 |
+
TOOLS = [
|
| 133 |
+
...
|
| 134 |
+
{"name": "youtube_transcript", "func": youtube_transcript,
|
| 135 |
+
"description": "Extract transcript from YouTube video URL. Use when question mentions YouTube video content like dialogue, speech, or visual descriptions."},
|
| 136 |
+
]
|
| 137 |
+
```
|
| 138 |
|
| 139 |
+
#### Step 1.4: Test
|
| 140 |
|
| 141 |
+
```bash
|
| 142 |
+
# Test on question #3
|
| 143 |
+
Target Task ID: a1e91b78-d3d8-4675-bb8d-62741b4b68a6
|
| 144 |
+
```
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
**Expected impact:** +2 questions (30% → 40% if both work)
|
|
|
|
| 147 |
|
| 148 |
---
|
| 149 |
|
| 150 |
+
### Phase 2: MP3 Audio Support (MEDIUM Priority)
|
| 151 |
|
| 152 |
+
**Goal:** Fix questions #10 and #13 (MP3 audio files)
|
| 153 |
|
| 154 |
+
**Root Cause:** parse_file doesn't support .mp3
|
| 155 |
|
| 156 |
+
**Solution:** Add audio transcription tool
|
|
|
|
|
|
|
| 157 |
|
| 158 |
+
**Implementation:**
|
| 159 |
+
```python
|
| 160 |
+
# NEW: src/tools/audio.py
|
| 161 |
+
import whisper
|
| 162 |
+
|
| 163 |
+
def transcribe_audio(file_path: str) -> str:
|
| 164 |
+
"""Transcribe audio file to text using OpenAI Whisper."""
|
| 165 |
+
model = whisper.load_model("base")
|
| 166 |
+
result = model.transcribe(file_path)
|
| 167 |
+
return result["text"]
|
| 168 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
+
**Alternative:** HuggingFace audio models (free)
|
| 171 |
+
- `openai/whisper-base`
|
| 172 |
+
- Use via Inference API
|
| 173 |
|
| 174 |
+
**Step 2.1:** Choose implementation (Whisper vs HF)
|
| 175 |
+
**Step 2.2:** Implement audio tool
|
| 176 |
+
**Step 2.3:** Add to TOOLS registry
|
| 177 |
+
**Step 2.4:** Test on #10 and #13
|
| 178 |
|
| 179 |
+
**Expected impact:** +2 questions (30% → 40% if both work)
|
|
|
|
|
|
|
| 180 |
|
| 181 |
---
|
| 182 |
|
| 183 |
+
### Phase 3: Python Code Execution (LOW Priority)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
+
**Goal:** Fix question #12 (Python code output)
|
|
|
|
|
|
|
|
|
|
| 186 |
|
| 187 |
+
**Root Cause:** parse_file doesn't support .py execution
|
| 188 |
|
| 189 |
+
**Solution:** Add code execution tool (sandboxed)
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
+
**Security Concern:** ⚠️ **DANGEROUS** - executing arbitrary Python code
|
| 192 |
|
| 193 |
+
**Options:**
|
| 194 |
+
1. **Restricted execution** - Only allow specific operations
|
| 195 |
+
2. **Docker container** - Isolate execution
|
| 196 |
+
3. **Skip for now** - Defer due to security concerns
|
| 197 |
|
| 198 |
+
**Decision:** Mark as **DEFERRED** due to security complexity
|
| 199 |
|
| 200 |
+
**Expected impact:** +1 question (if implemented)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
---
|
| 203 |
|
| 204 |
+
### Phase 4: CSV Table Issue (LLM Quality)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
|
| 206 |
+
**Goal:** Fix question #6 (table commutativity)
|
| 207 |
|
| 208 |
+
**Root Cause:** LLM tries to load `table_data.csv` when data is IN the question
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
|
| 210 |
+
**Solution:** This is NOT technical - LLM needs better prompts or tool selection
|
| 211 |
|
| 212 |
+
**Approaches:**
|
| 213 |
+
1. Improve system prompt to recognize data in questions
|
| 214 |
+
2. Add hint in question preprocessing
|
| 215 |
+
3. Special handling for markdown tables in questions
|
| 216 |
|
| 217 |
+
**Current workaround:** System correctly identifies as "no_evidence" and doesn't crash
|
|
|
|
|
|
|
|
|
|
| 218 |
|
| 219 |
+
**Status:** Defer to LLM quality improvements (Phase 5)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 220 |
|
| 221 |
---
|
| 222 |
|
| 223 |
+
### Phase 5: LLM Quality Improvements
|
|
|
|
|
|
|
| 224 |
|
| 225 |
+
**Goal:** Convert "Unable to answer" → correct answers
|
| 226 |
|
| 227 |
+
**Target questions (by category):**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
|
| 229 |
+
**Knowledge/Research (9 questions):** #2, #4, #11, #14, #16, #17, #18, #19
|
| 230 |
+
**Reasoning/Calculation (2 questions):** #1, #20
|
| 231 |
+
**Vision+Reasoning (1 question):** #7
|
| 232 |
|
| 233 |
+
**Approaches:**
|
| 234 |
+
1. **Better prompts** - Emphasize exact answer format
|
| 235 |
+
2. **Tool selection hints** - Guide LLM to use appropriate tools
|
| 236 |
+
3. **Few-shot examples** - Show LLM expected answer format
|
| 237 |
+
4. **Chain-of-thought** - Encourage step-by-step reasoning
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
**Implementation:**
|
| 240 |
+
- Update `synthesize_answer()` prompt
|
| 241 |
+
- Add answer format examples to system prompt
|
| 242 |
+
- Improve tool descriptions for better selection
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
---
|
| 245 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 246 |
## Success Criteria
|
| 247 |
|
| 248 |
+
### Phase 1: YouTube Support
|
| 249 |
+
- [ ] YouTube transcript tool implemented
|
| 250 |
+
- [ ] Question #3 answered correctly (bird species = "3")
|
| 251 |
+
- [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
|
| 252 |
+
- [ ] **Score: 10% → 40% (4/20)** ✅ TARGET REACHED
|
| 253 |
+
|
| 254 |
+
### Phase 2: MP3 Support
|
| 255 |
+
- [ ] Audio transcription tool implemented
|
| 256 |
+
- [ ] Question #10 answered correctly (pie ingredients)
|
| 257 |
+
- [ ] Question #13 answered correctly (page numbers)
|
| 258 |
+
- [ ] **Score: 40% → 50% (10/20)** ✅ EXCEEDS TARGET
|
| 259 |
+
|
| 260 |
+
### Phase 3: Python Execution
|
| 261 |
+
- [ ] Code execution tool implemented (sandboxed)
|
| 262 |
+
- [ ] Question #12 answered correctly (output = "0")
|
| 263 |
+
- [ ] **Score: 50% → 55% (11/20)**
|
| 264 |
+
|
| 265 |
+
### Phase 4: CSV Table
|
| 266 |
+
- [ ] LLM recognizes data in question
|
| 267 |
+
- [ ] Question #6 answered correctly ("b, e")
|
| 268 |
+
- [ ] **Score: 55% → 60% (12/20)**
|
| 269 |
+
|
| 270 |
+
### Phase 5: LLM Quality
|
| 271 |
+
- [ ] "Unable to answer" reduced by 50%
|
| 272 |
+
- [ ] At least 3 more knowledge questions correct
|
| 273 |
+
- [ ] **Score: 60% → 75%+ (15/20)**
|
| 274 |
|
| 275 |
+
## Files to Modify
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
|
| 277 |
+
### Phase 1: YouTube
|
| 278 |
+
1. **requirements.txt** - Add `youtube-transcript-api`
|
| 279 |
+
2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
|
| 280 |
+
3. **src/tools/__init__.py** - Register youtube_transcript tool
|
| 281 |
|
| 282 |
+
### Phase 2: MP3 Audio
|
| 283 |
+
1. **requirements.txt** - Add `openai-whisper` or HF audio
|
| 284 |
+
2. **src/tools/audio.py** (NEW) - Audio transcription
|
| 285 |
+
3. **src/tools/__init__.py** - Register transcribe_audio tool
|
| 286 |
|
| 287 |
+
### Phase 3-5: LLM Quality
|
| 288 |
+
1. **src/agent/graph.py** - Update prompts
|
| 289 |
+
2. **src/tools/__init__.py** - Improve tool descriptions
|
| 290 |
|
| 291 |
+
## Removed (Not Relevant)
|
| 292 |
|
| 293 |
+
- ~~Phase 0: Vision API validation~~ (already using Gemma 3)
|
| 294 |
+
- ~~Phase 1: HuggingFace vision~~ (not current priority)
|
| 295 |
+
- ~~Phase 2: Smoke tests~~ (already working)
|
| 296 |
+
- ~~Phase 3: GAIA evaluation~~ (running successfully)
|
| 297 |
+
- ~~Phase 5: Groq vision~~ (fallback archived)
|
| 298 |
+
- ~~Phase 6: Final verification~~ (premature)
|
| 299 |
+
- ~~Phase 7: File attachment~~ (already implemented)
|
| 300 |
|
| 301 |
+
## Decision Gates
|
|
|
|
|
|
|
| 302 |
|
| 303 |
+
**Gate 1 (YouTube):** Does transcript solve both video questions?
|
| 304 |
+
- **YES:** 40% score, proceed to Phase 2
|
| 305 |
+
- **NO:** Try frame extraction approach
|
| 306 |
|
| 307 |
+
**Gate 2 (MP3):** Does transcription solve both audio questions?
|
| 308 |
+
- **YES:** 50% score, proceed to Phase 3
|
| 309 |
+
- **NO:** Try different audio model
|
| 310 |
|
| 311 |
+
**Gate 3 (Target):** Have we reached 30% (6/20)?
|
| 312 |
+
- **YES:** ✅ SUCCESS - course target met
|
| 313 |
+
- **NO:** Continue to Phase 4-5
|
|
|
|
| 314 |
|
| 315 |
## Next Actions
|
| 316 |
|
| 317 |
+
**Start with Phase 1 (YouTube):**
|
| 318 |
|
| 319 |
+
1. [ ] Install youtube-transcript-api
|
| 320 |
+
2. [ ] Create src/tools/youtube.py
|
| 321 |
+
3. [ ] Add youtube_transcript to TOOLS
|
| 322 |
+
4. [ ] Test on question #3: `a1e91b78-d3d8-4675-bb8d-62741b4b68a6`
|
| 323 |
+
5. [ ] Run full evaluation
|
| 324 |
+
6. [ ] Verify 40% score (4/20 correct)
|
| 325 |
|
| 326 |
+
**After YouTube:** Proceed to MP3 support (Phase 2)
|
| 327 |
|
| 328 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 329 |
|
| 330 |
+
## Backup Options
|
|
|
|
|
|
|
|
|
|
| 331 |
|
| 332 |
+
If YouTube transcript doesn't work:
|
| 333 |
+
- **Plan B:** Extract video frames, analyze with vision tool
|
| 334 |
+
- **Plan C:** Skip video questions, focus on other fixes
|
| 335 |
|
| 336 |
+
If MP3 transcription doesn't work:
|
| 337 |
+
- **Plan B:** Use HuggingFace audio models
|
| 338 |
+
- **Plan C:** Skip audio questions, focus on LLM quality
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -347,7 +347,7 @@ ENABLE_LLM_FALLBACK=false # Disable fallback for debugging single provider
|
|
| 347 |
|
| 348 |
**Test Coverage:** 99 passing tests (~2min 40sec runtime)
|
| 349 |
|
| 350 |
-
> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
|
| 351 |
|
| 352 |
## Workflow
|
| 353 |
|
|
|
|
| 347 |
|
| 348 |
**Test Coverage:** 99 passing tests (~2min 40sec runtime)
|
| 349 |
|
| 350 |
+
> **Note:** This project implements the **Course Leaderboard** (20 questions, 30% target). See [GAIA Submission Guide](../agentbee/docs/gaia_submission_guide.md) for distinction between Course and Official GAIA leaderboards.
|
| 351 |
|
| 352 |
## Workflow
|
| 353 |
|
_template_original/README.md
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Template Final Assignment
|
| 3 |
+
emoji: 🕵🏻♂️
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: indigo
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 5.25.2
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
hf_oauth: true
|
| 11 |
+
# optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes.
|
| 12 |
+
hf_oauth_expiration_minutes: 480
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
_template_original/app.py
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import gradio as gr
|
| 3 |
+
import requests
|
| 4 |
+
import inspect
|
| 5 |
+
import pandas as pd
|
| 6 |
+
|
| 7 |
+
# (Keep Constants as is)
|
| 8 |
+
# --- Constants ---
|
| 9 |
+
DEFAULT_API_URL = "https://agents-course-unit4-scoring.hf.space"
|
| 10 |
+
|
| 11 |
+
# --- Basic Agent Definition ---
|
| 12 |
+
# ----- THIS IS WERE YOU CAN BUILD WHAT YOU WANT ------
|
| 13 |
+
class BasicAgent:
|
| 14 |
+
def __init__(self):
|
| 15 |
+
print("BasicAgent initialized.")
|
| 16 |
+
def __call__(self, question: str) -> str:
|
| 17 |
+
print(f"Agent received question (first 50 chars): {question[:50]}...")
|
| 18 |
+
fixed_answer = "This is a default answer."
|
| 19 |
+
print(f"Agent returning fixed answer: {fixed_answer}")
|
| 20 |
+
return fixed_answer
|
| 21 |
+
|
| 22 |
+
def run_and_submit_all( profile: gr.OAuthProfile | None):
|
| 23 |
+
"""
|
| 24 |
+
Fetches all questions, runs the BasicAgent on them, submits all answers,
|
| 25 |
+
and displays the results.
|
| 26 |
+
"""
|
| 27 |
+
# --- Determine HF Space Runtime URL and Repo URL ---
|
| 28 |
+
space_id = os.getenv("SPACE_ID") # Get the SPACE_ID for sending link to the code
|
| 29 |
+
|
| 30 |
+
if profile:
|
| 31 |
+
username= f"{profile.username}"
|
| 32 |
+
print(f"User logged in: {username}")
|
| 33 |
+
else:
|
| 34 |
+
print("User not logged in.")
|
| 35 |
+
return "Please Login to Hugging Face with the button.", None
|
| 36 |
+
|
| 37 |
+
api_url = DEFAULT_API_URL
|
| 38 |
+
questions_url = f"{api_url}/questions"
|
| 39 |
+
submit_url = f"{api_url}/submit"
|
| 40 |
+
|
| 41 |
+
# 1. Instantiate Agent ( modify this part to create your agent)
|
| 42 |
+
try:
|
| 43 |
+
agent = BasicAgent()
|
| 44 |
+
except Exception as e:
|
| 45 |
+
print(f"Error instantiating agent: {e}")
|
| 46 |
+
return f"Error initializing agent: {e}", None
|
| 47 |
+
# In the case of an app running as a hugging Face space, this link points toward your codebase ( usefull for others so please keep it public)
|
| 48 |
+
agent_code = f"https://huggingface.co/spaces/{space_id}/tree/main"
|
| 49 |
+
print(agent_code)
|
| 50 |
+
|
| 51 |
+
# 2. Fetch Questions
|
| 52 |
+
print(f"Fetching questions from: {questions_url}")
|
| 53 |
+
try:
|
| 54 |
+
response = requests.get(questions_url, timeout=15)
|
| 55 |
+
response.raise_for_status()
|
| 56 |
+
questions_data = response.json()
|
| 57 |
+
if not questions_data:
|
| 58 |
+
print("Fetched questions list is empty.")
|
| 59 |
+
return "Fetched questions list is empty or invalid format.", None
|
| 60 |
+
print(f"Fetched {len(questions_data)} questions.")
|
| 61 |
+
except requests.exceptions.RequestException as e:
|
| 62 |
+
print(f"Error fetching questions: {e}")
|
| 63 |
+
return f"Error fetching questions: {e}", None
|
| 64 |
+
except requests.exceptions.JSONDecodeError as e:
|
| 65 |
+
print(f"Error decoding JSON response from questions endpoint: {e}")
|
| 66 |
+
print(f"Response text: {response.text[:500]}")
|
| 67 |
+
return f"Error decoding server response for questions: {e}", None
|
| 68 |
+
except Exception as e:
|
| 69 |
+
print(f"An unexpected error occurred fetching questions: {e}")
|
| 70 |
+
return f"An unexpected error occurred fetching questions: {e}", None
|
| 71 |
+
|
| 72 |
+
# 3. Run your Agent
|
| 73 |
+
results_log = []
|
| 74 |
+
answers_payload = []
|
| 75 |
+
print(f"Running agent on {len(questions_data)} questions...")
|
| 76 |
+
for item in questions_data:
|
| 77 |
+
task_id = item.get("task_id")
|
| 78 |
+
question_text = item.get("question")
|
| 79 |
+
if not task_id or question_text is None:
|
| 80 |
+
print(f"Skipping item with missing task_id or question: {item}")
|
| 81 |
+
continue
|
| 82 |
+
try:
|
| 83 |
+
submitted_answer = agent(question_text)
|
| 84 |
+
answers_payload.append({"task_id": task_id, "submitted_answer": submitted_answer})
|
| 85 |
+
results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": submitted_answer})
|
| 86 |
+
except Exception as e:
|
| 87 |
+
print(f"Error running agent on task {task_id}: {e}")
|
| 88 |
+
results_log.append({"Task ID": task_id, "Question": question_text, "Submitted Answer": f"AGENT ERROR: {e}"})
|
| 89 |
+
|
| 90 |
+
if not answers_payload:
|
| 91 |
+
print("Agent did not produce any answers to submit.")
|
| 92 |
+
return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
|
| 93 |
+
|
| 94 |
+
# 4. Prepare Submission
|
| 95 |
+
submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
|
| 96 |
+
status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
|
| 97 |
+
print(status_update)
|
| 98 |
+
|
| 99 |
+
# 5. Submit
|
| 100 |
+
print(f"Submitting {len(answers_payload)} answers to: {submit_url}")
|
| 101 |
+
try:
|
| 102 |
+
response = requests.post(submit_url, json=submission_data, timeout=60)
|
| 103 |
+
response.raise_for_status()
|
| 104 |
+
result_data = response.json()
|
| 105 |
+
final_status = (
|
| 106 |
+
f"Submission Successful!\n"
|
| 107 |
+
f"User: {result_data.get('username')}\n"
|
| 108 |
+
f"Overall Score: {result_data.get('score', 'N/A')}% "
|
| 109 |
+
f"({result_data.get('correct_count', '?')}/{result_data.get('total_attempted', '?')} correct)\n"
|
| 110 |
+
f"Message: {result_data.get('message', 'No message received.')}"
|
| 111 |
+
)
|
| 112 |
+
print("Submission successful.")
|
| 113 |
+
results_df = pd.DataFrame(results_log)
|
| 114 |
+
return final_status, results_df
|
| 115 |
+
except requests.exceptions.HTTPError as e:
|
| 116 |
+
error_detail = f"Server responded with status {e.response.status_code}."
|
| 117 |
+
try:
|
| 118 |
+
error_json = e.response.json()
|
| 119 |
+
error_detail += f" Detail: {error_json.get('detail', e.response.text)}"
|
| 120 |
+
except requests.exceptions.JSONDecodeError:
|
| 121 |
+
error_detail += f" Response: {e.response.text[:500]}"
|
| 122 |
+
status_message = f"Submission Failed: {error_detail}"
|
| 123 |
+
print(status_message)
|
| 124 |
+
results_df = pd.DataFrame(results_log)
|
| 125 |
+
return status_message, results_df
|
| 126 |
+
except requests.exceptions.Timeout:
|
| 127 |
+
status_message = "Submission Failed: The request timed out."
|
| 128 |
+
print(status_message)
|
| 129 |
+
results_df = pd.DataFrame(results_log)
|
| 130 |
+
return status_message, results_df
|
| 131 |
+
except requests.exceptions.RequestException as e:
|
| 132 |
+
status_message = f"Submission Failed: Network error - {e}"
|
| 133 |
+
print(status_message)
|
| 134 |
+
results_df = pd.DataFrame(results_log)
|
| 135 |
+
return status_message, results_df
|
| 136 |
+
except Exception as e:
|
| 137 |
+
status_message = f"An unexpected error occurred during submission: {e}"
|
| 138 |
+
print(status_message)
|
| 139 |
+
results_df = pd.DataFrame(results_log)
|
| 140 |
+
return status_message, results_df
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
# --- Build Gradio Interface using Blocks ---
|
| 144 |
+
with gr.Blocks() as demo:
|
| 145 |
+
gr.Markdown("# Basic Agent Evaluation Runner")
|
| 146 |
+
gr.Markdown(
|
| 147 |
+
"""
|
| 148 |
+
**Instructions:**
|
| 149 |
+
|
| 150 |
+
1. Please clone this space, then modify the code to define your agent's logic, the tools, the necessary packages, etc ...
|
| 151 |
+
2. Log in to your Hugging Face account using the button below. This uses your HF username for submission.
|
| 152 |
+
3. Click 'Run Evaluation & Submit All Answers' to fetch questions, run your agent, submit answers, and see the score.
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
**Disclaimers:**
|
| 156 |
+
Once clicking on the "submit button, it can take quite some time ( this is the time for the agent to go through all the questions).
|
| 157 |
+
This space provides a basic setup and is intentionally sub-optimal to encourage you to develop your own, more robust solution. For instance for the delay process of the submit button, a solution could be to cache the answers and submit in a seperate action or even to answer the questions in async.
|
| 158 |
+
"""
|
| 159 |
+
)
|
| 160 |
+
|
| 161 |
+
gr.LoginButton()
|
| 162 |
+
|
| 163 |
+
run_button = gr.Button("Run Evaluation & Submit All Answers")
|
| 164 |
+
|
| 165 |
+
status_output = gr.Textbox(label="Run Status / Submission Result", lines=5, interactive=False)
|
| 166 |
+
# Removed max_rows=10 from DataFrame constructor
|
| 167 |
+
results_table = gr.DataFrame(label="Questions and Agent Answers", wrap=True)
|
| 168 |
+
|
| 169 |
+
run_button.click(
|
| 170 |
+
fn=run_and_submit_all,
|
| 171 |
+
outputs=[status_output, results_table]
|
| 172 |
+
)
|
| 173 |
+
|
| 174 |
+
if __name__ == "__main__":
|
| 175 |
+
print("\n" + "-"*30 + " App Starting " + "-"*30)
|
| 176 |
+
# Check for SPACE_HOST and SPACE_ID at startup for information
|
| 177 |
+
space_host_startup = os.getenv("SPACE_HOST")
|
| 178 |
+
space_id_startup = os.getenv("SPACE_ID") # Get SPACE_ID at startup
|
| 179 |
+
|
| 180 |
+
if space_host_startup:
|
| 181 |
+
print(f"✅ SPACE_HOST found: {space_host_startup}")
|
| 182 |
+
print(f" Runtime URL should be: https://{space_host_startup}.hf.space")
|
| 183 |
+
else:
|
| 184 |
+
print("ℹ️ SPACE_HOST environment variable not found (running locally?).")
|
| 185 |
+
|
| 186 |
+
if space_id_startup: # Print repo URLs if SPACE_ID is found
|
| 187 |
+
print(f"✅ SPACE_ID found: {space_id_startup}")
|
| 188 |
+
print(f" Repo URL: https://huggingface.co/spaces/{space_id_startup}")
|
| 189 |
+
print(f" Repo Tree URL: https://huggingface.co/spaces/{space_id_startup}/tree/main")
|
| 190 |
+
else:
|
| 191 |
+
print("ℹ️ SPACE_ID environment variable not found (running locally?). Repo URL cannot be determined.")
|
| 192 |
+
|
| 193 |
+
print("-"*(60 + len(" App Starting ")) + "\n")
|
| 194 |
+
|
| 195 |
+
print("Launching Gradio Interface for Basic Agent Evaluation...")
|
| 196 |
+
demo.launch(debug=True, share=False)
|
_template_original/requirements.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio
|
| 2 |
+
requests
|
brainstorming_phase1_youtube.md
ADDED
|
@@ -0,0 +1,345 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 1 Brainstorming - YouTube Transcript Support
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-01-13
|
| 4 |
+
**Status:** Discussion Phase
|
| 5 |
+
**Goal:** Fix questions #3 and #5 (YouTube videos) → 40% score
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Question Analysis
|
| 10 |
+
|
| 11 |
+
| Question | Task ID | Description | Expected Answer | Type |
|
| 12 |
+
| -------- | -------------------------------------- | ------------------------------- | --------------- | ------------- |
|
| 13 |
+
| #3 | `a1e91b78-d3d8-4675-bb8d-62741b4b68a6` | YouTube video - bird species | "3" | Content-based |
|
| 14 |
+
| #5 | (Teal'c quote) | YouTube video - character quote | "Extremely" | Dialogue |
|
| 15 |
+
|
| 16 |
+
**Conclusion:** Both are content-based questions → transcript approach should work ✅
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Library Options
|
| 21 |
+
|
| 22 |
+
### Option A: youtube-transcript-api ⭐ Recommended
|
| 23 |
+
|
| 24 |
+
- **Pros:** Simple API, actively maintained, no video download needed, fast
|
| 25 |
+
- **Cons:** May fail on videos without captions, regional restrictions
|
| 26 |
+
- **Use case:** Start here for simplicity
|
| 27 |
+
|
| 28 |
+
### Option B: yt-dlp + transcript extraction
|
| 29 |
+
|
| 30 |
+
- **Pros:** More robust, can fall back to auto-generated captions
|
| 31 |
+
- **Cons:** Heavier dependency, slower
|
| 32 |
+
- **Use case:** Backup if Option A has high failure rate
|
| 33 |
+
|
| 34 |
+
### Option C: Direct YouTube API
|
| 35 |
+
|
| 36 |
+
- **Pros:** Most control
|
| 37 |
+
- **Cons:** Requires API key, more complex
|
| 38 |
+
- **Use case:** Probably overkill for this use case
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Frame Extraction: Corrected Analysis
|
| 43 |
+
|
| 44 |
+
**Key insight:** Frame extraction itself is FAST. The "slow" parts are download + vision API processing.
|
| 45 |
+
|
| 46 |
+
### Actual Timing Breakdown
|
| 47 |
+
|
| 48 |
+
| Step | Time (10-min video) | Notes |
|
| 49 |
+
| -------------------- | ------------------- | -------------------------------------- |
|
| 50 |
+
| **Download** | 30s - 3 min | Network I/O, one-time cost |
|
| 51 |
+
| **Frame extraction** | **5 - 20 sec** | ffmpeg is I/O bound, very efficient ⚡ |
|
| 52 |
+
| **Vision API calls** | 20s - 5 min | Sequential: 600 frames × 2-5s each |
|
| 53 |
+
|
| 54 |
+
**Reality check:** You can extract 600 frames from a local 10-min video in under 15 seconds with ffmpeg. The "slow" part is vision model API calls, not the extraction.
|
| 55 |
+
|
| 56 |
+
**Bottom line:** Frame extraction is cheap compute. Vision processing is expensive compute.
|
| 57 |
+
|
| 58 |
+
### Comparison
|
| 59 |
+
|
| 60 |
+
| Approach | What's Fast | What's Slow | Total Time |
|
| 61 |
+
| -------------------- | ------------------ | ------------------------------------------- | ---------------- |
|
| 62 |
+
| **Transcript** | API call (1-3s) | - | **1-3 seconds** |
|
| 63 |
+
| **Frame Extraction** | Extraction (5-20s) | Download (30s-3min) + Vision API (20s-5min) | **1-10 minutes** |
|
| 64 |
+
|
| 65 |
+
### Do Tools Matter?
|
| 66 |
+
|
| 67 |
+
| Tool | Speed (extraction only) | Verdict |
|
| 68 |
+
| ------- | ----------------------- | --------------- |
|
| 69 |
+
| ffmpeg | ⚡⚡⚡ Fastest (5-10s) | Best choice |
|
| 70 |
+
| OpenCV | ⚡⚡ Fast (10-20s) | Standard choice |
|
| 71 |
+
| moviepy | ⚡ Medium (20-40s) | Python overhead |
|
| 72 |
+
|
| 73 |
+
**For extraction alone:** Tools matter, but all are fast enough.
|
| 74 |
+
|
| 75 |
+
### When Is Frame Extraction Worth It?
|
| 76 |
+
|
| 77 |
+
**Only when:**
|
| 78 |
+
|
| 79 |
+
- Question is purely visual (no audio/transcript available)
|
| 80 |
+
- Visual information is NOT in video thumbnail/title/description
|
| 81 |
+
- You have no other choice
|
| 82 |
+
|
| 83 |
+
**Examples where necessary:**
|
| 84 |
+
|
| 85 |
+
- "What color shirt is the person wearing at 2:35?"
|
| 86 |
+
- "Count the number of cars visible in the video"
|
| 87 |
+
- "Describe the visual style of the opening scene"
|
| 88 |
+
|
| 89 |
+
**For GAIA #3 and #5:**
|
| 90 |
+
|
| 91 |
+
- Both are content-based (species mentioned, dialogue)
|
| 92 |
+
- Transcript is still fastest (1-3s vs 1-10 min total)
|
| 93 |
+
- Frame extraction as fallback is viable (extraction is fast, but vision processing is slow)
|
| 94 |
+
|
| 95 |
+
**Decision:** Transcript-first approach is correct. Frame extraction is viable fallback if transcript unavailable, but total time still 1-10 min due to download + vision API.
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Fallback Strategy
|
| 100 |
+
|
| 101 |
+
**Scenario:** Video has no transcript available
|
| 102 |
+
|
| 103 |
+
**Options:**
|
| 104 |
+
|
| 105 |
+
1. **Return error** → LLM treats as system_error, skips question ✅ Simple
|
| 106 |
+
2. **Download + extract frames** → Use vision tool (heavy, slow)
|
| 107 |
+
3. **Return metadata** (title, description) → LLM infers from context
|
| 108 |
+
4. **Chain approach:** Transcript → Metadata → Frames
|
| 109 |
+
|
| 110 |
+
**Decision:** Start with audio-to-text fallback (Whisper on ZeroGPU) for higher success rate.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## Audio-to-Text Fallback: When No Transcript Available
|
| 115 |
+
|
| 116 |
+
### The Hierarchy
|
| 117 |
+
|
| 118 |
+
```
|
| 119 |
+
YouTube URL
|
| 120 |
+
│
|
| 121 |
+
├─ Has transcript? ✅ → Use youtube-transcript-api (instant, 1-3 sec)
|
| 122 |
+
│
|
| 123 |
+
└─ No transcript? ❌ → Download audio + Whisper (slower, but works)
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### Whisper Cost Analysis
|
| 127 |
+
|
| 128 |
+
| Option | Cost | Speed | Verdict |
|
| 129 |
+
| --------------- | ---------- | -------------- | ------------------ |
|
| 130 |
+
| OpenAI API | $0.006/min | ⚡⚡⚡ Fastest | If budget OK |
|
| 131 |
+
| **Open Source** | **FREE** | ⚡⚡ Fast | ⭐ **Recommended** |
|
| 132 |
+
| HuggingFace | FREE | ⚡⚡ Fast | Good alternative |
|
| 133 |
+
|
| 134 |
+
**Decision:** Open-source Whisper (free, no API limits, works offline)
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
### HF Hardware: ZeroGPU ✅
|
| 139 |
+
|
| 140 |
+
| Resource | Available | Whisper Requirements | Verdict |
|
| 141 |
+
| ---------- | ----------- | ------------------------- | --------------------------------- |
|
| 142 |
+
| **CPU** | 4 vCPUs | 1+ cores | ✅ Plenty |
|
| 143 |
+
| **Memory** | 16 GB RAM | 1-10 GB (model-dependent) | ✅ Comfortable |
|
| 144 |
+
| **Disk** | 20 GB | ~150 MB - 1.5 GB | ✅ More than enough |
|
| 145 |
+
| **GPU** | **ZeroGPU** | Optional (faster) | ✅ **Available via subscription** |
|
| 146 |
+
|
| 147 |
+
**ZeroGPU Benefits:**
|
| 148 |
+
|
| 149 |
+
- ✅ Dynamic GPU allocation (5-10x faster than CPU)
|
| 150 |
+
- ✅ Can use larger models (`small`, `medium`) for better accuracy
|
| 151 |
+
- ✅ Still free (subscription benefit)
|
| 152 |
+
|
| 153 |
+
### Performance: CPU vs ZeroGPU
|
| 154 |
+
|
| 155 |
+
| Model | On CPU | On ZeroGPU | Speedup |
|
| 156 |
+
| -------- | --------- | ------------- | ------- |
|
| 157 |
+
| `base` | 30-60 sec | **5-10 sec** | 5-10x |
|
| 158 |
+
| `small` | 1-2 min | **10-20 sec** | 5-10x |
|
| 159 |
+
| `medium` | 3-5 min | **20-40 sec** | 5-10x |
|
| 160 |
+
|
| 161 |
+
**For 5-minute YouTube video on ZeroGPU:**
|
| 162 |
+
|
| 163 |
+
- `base` model: ~5-10 seconds ⚡⚡⚡
|
| 164 |
+
- `small` model: ~10-20 seconds ⚡⚡
|
| 165 |
+
|
| 166 |
+
### Recommended Model for ZeroGPU
|
| 167 |
+
|
| 168 |
+
| Model | Size | Accuracy | Speed (ZeroGPU) | Recommendation |
|
| 169 |
+
| -------- | ------ | --------- | --------------- | ---------------------- |
|
| 170 |
+
| `tiny` | 39 MB | Lower | ~5 sec | Fastest, less accurate |
|
| 171 |
+
| `base` | 74 MB | Good | ~10 sec | Good balance |
|
| 172 |
+
| `small` | 244 MB | Better | ~20 sec | ⭐ **Recommended** |
|
| 173 |
+
| `medium` | 769 MB | Very good | ~40 sec | If accuracy critical |
|
| 174 |
+
|
| 175 |
+
**Choice:** `small` model - best accuracy/speed balance on ZeroGPU
|
| 176 |
+
|
| 177 |
+
### Implementation: Audio-to-Text Fallback
|
| 178 |
+
|
| 179 |
+
```python
|
| 180 |
+
import whisper
|
| 181 |
+
|
| 182 |
+
_MODEL = None # Cache model globally
|
| 183 |
+
|
| 184 |
+
def transcribe_audio(file_path: str) -> str:
|
| 185 |
+
"""Transcribe audio file using Whisper (ZeroGPU)."""
|
| 186 |
+
global _MODEL
|
| 187 |
+
try:
|
| 188 |
+
if _MODEL is None:
|
| 189 |
+
# ZeroGPU auto-detects GPU, no manual device specification
|
| 190 |
+
_MODEL = whisper.load_model("small")
|
| 191 |
+
|
| 192 |
+
result = _MODEL.transcribe(file_path)
|
| 193 |
+
return result["text"]
|
| 194 |
+
except Exception as e:
|
| 195 |
+
return f"ERROR: Transcription failed: {e}"
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
### Unified Architecture: Phase 1 + Phase 2
|
| 201 |
+
|
| 202 |
+
```
|
| 203 |
+
┌─────────────────────────────────────────────────────────┐
|
| 204 |
+
│ Audio Transcription │
|
| 205 |
+
│ (transcribe_audio function) │
|
| 206 |
+
│ Uses Whisper │
|
| 207 |
+
│ on ZeroGPU │
|
| 208 |
+
└─────────────────────────────────────────────────────────┘
|
| 209 |
+
▲
|
| 210 |
+
│
|
| 211 |
+
┌───────────────────┴───────────────────┐
|
| 212 |
+
│ │
|
| 213 |
+
Phase 1 Phase 2
|
| 214 |
+
YouTube URLs MP3 Files
|
| 215 |
+
│ │
|
| 216 |
+
│ 1. Try youtube-transcript-api │
|
| 217 |
+
│ 2. Fallback: download audio only │
|
| 218 |
+
│ 3. Call transcribe_audio() │
|
| 219 |
+
│ │
|
| 220 |
+
└───────────────────┬───────────────────┘
|
| 221 |
+
│
|
| 222 |
+
Clean transcript
|
| 223 |
+
│
|
| 224 |
+
▼
|
| 225 |
+
LLM analyzes
|
| 226 |
+
```
|
| 227 |
+
|
| 228 |
+
**Benefits:**
|
| 229 |
+
|
| 230 |
+
- Single audio processing codebase
|
| 231 |
+
- `transcribe_audio()` works for both phases
|
| 232 |
+
- Tested on HF ZeroGPU hardware
|
| 233 |
+
- Higher success rate than skip-only approach
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## Tool Design - LLM Integration
|
| 238 |
+
|
| 239 |
+
**Current problem:** Vision tool tries to process YouTube URL → fails
|
| 240 |
+
|
| 241 |
+
**Proposed tool description:**
|
| 242 |
+
|
| 243 |
+
```
|
| 244 |
+
"Extract transcript from YouTube video URL. Use when question asks about
|
| 245 |
+
YouTube video content like: dialogue, speech, bird species identification,
|
| 246 |
+
character quotes, or any content discussed in the video. Input: YouTube URL.
|
| 247 |
+
Returns: Full transcript text or error message if transcript unavailable."
|
| 248 |
+
```
|
| 249 |
+
|
| 250 |
+
**Alternative: Special URL handling in `parse_file()`**
|
| 251 |
+
|
| 252 |
+
- Detect YouTube URLs
|
| 253 |
+
- Return tool suggestion: "This is a YouTube URL. Consider using youtube_transcript tool."
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## Implementation Considerations
|
| 258 |
+
|
| 259 |
+
### A. Video ID Extraction
|
| 260 |
+
|
| 261 |
+
Handle various YouTube URL formats:
|
| 262 |
+
|
| 263 |
+
- `youtube.com/watch?v=VIDEO_ID`
|
| 264 |
+
- `youtu.be/VIDEO_ID`
|
| 265 |
+
- `youtube.com/shorts/VIDEO_ID`
|
| 266 |
+
|
| 267 |
+
### B. Language Handling
|
| 268 |
+
|
| 269 |
+
- GAIA questions are English → likely English transcripts
|
| 270 |
+
- Question: Should we auto-translate or let LLM handle?
|
| 271 |
+
|
| 272 |
+
### C. Transcript Format
|
| 273 |
+
|
| 274 |
+
- Raw JSON with timestamps vs clean text
|
| 275 |
+
- LLM prefers clean text without timestamps
|
| 276 |
+
- Question: Preserve timestamps for context?
|
| 277 |
+
|
| 278 |
+
### D. Error Types
|
| 279 |
+
|
| 280 |
+
- No transcript available
|
| 281 |
+
- Video private/deleted
|
| 282 |
+
- Rate limiting
|
| 283 |
+
- Regional restriction
|
| 284 |
+
|
| 285 |
+
---
|
| 286 |
+
|
| 287 |
+
## Testing Strategy
|
| 288 |
+
|
| 289 |
+
**Before full evaluation:**
|
| 290 |
+
|
| 291 |
+
1. **Unit test** - Test on actual GAIA YouTube URLs
|
| 292 |
+
2. **Manual test** - Run single question (#3) to verify LLM uses tool correctly
|
| 293 |
+
3. **Integration test** - Verify transcript → answer pipeline
|
| 294 |
+
|
| 295 |
+
**Question:** Do we have access to actual YouTube URLs for pre-testing?
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
## Edge Cases
|
| 300 |
+
|
| 301 |
+
| Scenario | Handling |
|
| 302 |
+
| --------------------------------- | --------------------------------- |
|
| 303 |
+
| Multiple transcript languages | Pick English or first available |
|
| 304 |
+
| Auto-generated transcript | Accept (less accurate but usable) |
|
| 305 |
+
| YouTube Shorts format | Extract VIDEO_ID from shorts URL |
|
| 306 |
+
| Segmented transcript (by speaker) | Clean to plain text |
|
| 307 |
+
|
| 308 |
+
---
|
| 309 |
+
|
| 310 |
+
## Recommendations
|
| 311 |
+
|
| 312 |
+
1. **Start simple:** youtube-transcript-api with clear error messages
|
| 313 |
+
2. **Fail gracefully:** If no transcript, return structured error → system_error=yes
|
| 314 |
+
3. **Tool description:** Emphasize "YouTube video content" for LLM selection
|
| 315 |
+
4. **Manual test first:** Verify on question #3 before full evaluation
|
| 316 |
+
5. **Success metric:** Both questions correct → 40% score ✅ TARGET REACHED
|
| 317 |
+
|
| 318 |
+
---
|
| 319 |
+
|
| 320 |
+
## Open Questions
|
| 321 |
+
|
| 322 |
+
- [ ] Implement fallback to frame extraction if transcript fails?
|
| 323 |
+
- [ ] Add special YouTube URL detection in `parse_file()`?
|
| 324 |
+
- [ ] Access to actual YouTube URLs for pre-testing?
|
| 325 |
+
- [ ] Simple first vs comprehensive solution?
|
| 326 |
+
|
| 327 |
+
---
|
| 328 |
+
|
| 329 |
+
## Files to Create
|
| 330 |
+
|
| 331 |
+
- `src/tools/youtube.py` - YouTube transcript extraction
|
| 332 |
+
- Update `src/tools/__init__.py` - Register youtube_transcript tool
|
| 333 |
+
- Update `requirements.txt` - Add youtube-transcript-api
|
| 334 |
+
|
| 335 |
+
---
|
| 336 |
+
|
| 337 |
+
## Next Steps (Discussion → Implementation)
|
| 338 |
+
|
| 339 |
+
1. [ ] Confirm approach based on video processing research
|
| 340 |
+
2. [ ] Install youtube-transcript-api
|
| 341 |
+
3. [ ] Create youtube.py with error handling
|
| 342 |
+
4. [ ] Add tool to TOOLS registry
|
| 343 |
+
5. [ ] Manual test on question #3
|
| 344 |
+
6. [ ] Full evaluation
|
| 345 |
+
7. [ ] Verify 40% score (4/20 correct)
|
dev/dev_260102_13_stage2_tool_development.md
CHANGED
|
@@ -140,40 +140,47 @@ Successfully implemented 4 production-ready tools with comprehensive error handl
|
|
| 140 |
|
| 141 |
**Deliverables:**
|
| 142 |
|
| 143 |
-
1. **Web Search Tool** ([src/tools/web_search.py](
|
|
|
|
| 144 |
- Tavily API integration (primary, free tier)
|
| 145 |
- Exa API integration (fallback, paid)
|
| 146 |
-
- Automatic fallback if primary fails
|
| 147 |
-
- 10 passing tests in [test/test_web_search.py](
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
2. **File Parser Tool** ([src/tools/file_parser.py](../src/tools/file_parser.py))
|
| 150 |
- PDF parsing (PyPDF2)
|
| 151 |
- Excel parsing (openpyxl)
|
| 152 |
- Word parsing (python-docx)
|
| 153 |
-
- Text/CSV parsing (built-in open)
|
| 154 |
- Generic `parse_file()` dispatcher
|
| 155 |
-
- 19 passing tests in [test/test_file_parser.py
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
3. **Calculator Tool** ([src/tools/calculator.py](../src/tools/calculator.py))
|
| 158 |
- Safe AST-based expression evaluation
|
| 159 |
-
- Whitelisted operations only (no code execution
|
| 160 |
- Mathematical functions (sin, cos, sqrt, factorial, etc.)
|
| 161 |
-
- Security hardened (timeout,
|
| 162 |
-
- 41 passing tests in [test/test_calculator.py](
|
| 163 |
|
| 164 |
-
4. **Vision Tool** ([src/tools/vision.py](
|
| 165 |
-
|
|
|
|
| 166 |
- Gemini 2.0 Flash (primary, free)
|
| 167 |
-
- Claude Sonnet 4.5 (fallback, paid)
|
| 168 |
- Image loading and base64 encoding
|
| 169 |
-
- 15 passing tests in [test/test_vision.py](
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
-
5. **Tool Registry** ([src/tools/__init__.py](../src/tools/__init__.py))
|
| 172 |
- Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
|
| 173 |
- TOOLS dict with metadata (description, parameters, category)
|
| 174 |
- Ready for Stage 3 dynamic tool selection
|
| 175 |
|
| 176 |
-
6. **StateGraph Integration** ([src/agent/graph.py](
|
| 177 |
- Updated `execute_node` to load tool registry
|
| 178 |
- Stage 2: Reports tool availability
|
| 179 |
- Stage 3: Will add dynamic tool selection and execution
|
|
|
|
| 140 |
|
| 141 |
**Deliverables:**
|
| 142 |
|
| 143 |
+
1. **Web Search Tool** ([src/tools/web_search.py](./../src/tools/web_search.pyeb_search.py))
|
| 144 |
+
|
| 145 |
- Tavily API integration (primary, free tier)
|
| 146 |
- Exa API integration (fallback, paid)
|
| 147 |
+
- Automatic fallback if primary fails./../test/test_web_search.py
|
| 148 |
+
- 10 passing tests in [test/test_web_search.py](../../agentbee/test/test_web_search.py)
|
| 149 |
+
./../src/tools/file_parser.py
|
| 150 |
+
|
| 151 |
+
2. **File Parser Tool** ([src/tools/file_parser.py](../../agentbee/src/tools/file_parser.py))
|
| 152 |
|
|
|
|
| 153 |
- PDF parsing (PyPDF2)
|
| 154 |
- Excel parsing (openpyxl)
|
| 155 |
- Word parsing (python-docx)
|
| 156 |
+
- Text/CSV parsing (built-in open)./../test/test_file_parser.py
|
| 157 |
- Generic `parse_file()` dispatcher
|
| 158 |
+
- 19 passing tests in [test/test_file_parser.py./../src/tools/calculator.py_file_parser.py)
|
| 159 |
+
|
| 160 |
+
3. **Calculator Tool** ([src/tools/calculator.py](../../agentbee/src/tools/calculator.py))
|
| 161 |
|
|
|
|
| 162 |
- Safe AST-based expression evaluation
|
| 163 |
+
- Whitelisted operations only (no code execution./../test/test_calculator.py
|
| 164 |
- Mathematical functions (sin, cos, sqrt, factorial, etc.)
|
| 165 |
+
- Security hardened (timeout, complexit./../src/tools/vision.py
|
| 166 |
+
- 41 passing tests in [test/test_calculator.py](../../agentbee/test/test_calculator.py)
|
| 167 |
|
| 168 |
+
4. **Vision Tool** ([src/tools/vision.py](../../agentbee/src/tools/vision.py))
|
| 169 |
+
|
| 170 |
+
- Multimodal image analysis using LLMs./../test/test_vision.py
|
| 171 |
- Gemini 2.0 Flash (primary, free)
|
| 172 |
+
- Claude Sonnet 4.5 (fallback, paid)./../src/tools/**init**.py
|
| 173 |
- Image loading and base64 encoding
|
| 174 |
+
- 15 passing tests in [test/test_vision.py](../../agentbee/test/test_vision.py)
|
| 175 |
+
|
| 176 |
+
5. **Tool Registry** ([src/tools/**init**.py](../../agentbee/src/tools/__init__.py))
|
| 177 |
+
./../src/agent/graph.py
|
| 178 |
|
|
|
|
| 179 |
- Exports all 4 main tools: `search`, `parse_file`, `safe_eval`, `analyze_image`
|
| 180 |
- TOOLS dict with metadata (description, parameters, category)
|
| 181 |
- Ready for Stage 3 dynamic tool selection
|
| 182 |
|
| 183 |
+
6. **StateGraph Integration** ([src/agent/graph.py](../../agentbee/src/agent/graph.py))
|
| 184 |
- Updated `execute_node` to load tool registry
|
| 185 |
- Stage 2: Reports tool availability
|
| 186 |
- Stage 3: Will add dynamic tool selection and execution
|
dev/dev_260102_14_stage3_core_logic.md
CHANGED
|
@@ -64,24 +64,28 @@ Successfully implemented Stage 3 with multi-provider LLM support. Agent now perf
|
|
| 64 |
|
| 65 |
**Deliverables:**
|
| 66 |
|
| 67 |
-
1. **LLM Client Module** ([src/agent/llm_client.py](
|
|
|
|
| 68 |
- Gemini implementation: 3 functions (planning, tool selection, answer synthesis)
|
| 69 |
- Claude implementation: 3 functions (same)
|
| 70 |
- Unified API with automatic fallback
|
| 71 |
- 624 lines of code
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
-
2. **Updated Agent Graph** ([src/agent/graph.py](../src/agent/graph.py))
|
| 74 |
- plan_node: Calls `plan_question()` for LLM-based planning
|
| 75 |
- execute_node: Calls `select_tools_with_function_calling()` + executes tools + collects evidence
|
| 76 |
- answer_node: Calls `synthesize_answer()` for factoid generation
|
| 77 |
-
- Updated AgentState with new fields
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
3. **LLM Integration Tests** ([test/test_llm_integration.py](../test/test_llm_integration.py))
|
| 80 |
- 8 tests covering all 3 LLM functions
|
| 81 |
-
- Tests use mocked LLM responses (provider-
|
| 82 |
- Full workflow test: planning → tool selection → answer synthesis
|
| 83 |
|
| 84 |
-
4. **E2E Test Script** ([test/test_stage3_e2e.py](
|
| 85 |
- Manual test script for real API testing
|
| 86 |
- Requires ANTHROPIC_API_KEY or GOOGLE_API_KEY
|
| 87 |
- Tests simple math and factual questions
|
|
|
|
| 64 |
|
| 65 |
**Deliverables:**
|
| 66 |
|
| 67 |
+
1. **LLM Client Module** ([src/agent/llm_client.py](./../src/agent/llm_client.pylm_client.py))
|
| 68 |
+
|
| 69 |
- Gemini implementation: 3 functions (planning, tool selection, answer synthesis)
|
| 70 |
- Claude implementation: 3 functions (same)
|
| 71 |
- Unified API with automatic fallback
|
| 72 |
- 624 lines of code
|
| 73 |
+
./../src/agent/graph.py
|
| 74 |
+
|
| 75 |
+
2. **Updated Agent Graph** ([src/agent/graph.py](../../agentbee/src/agent/graph.py))
|
| 76 |
|
|
|
|
| 77 |
- plan_node: Calls `plan_question()` for LLM-based planning
|
| 78 |
- execute_node: Calls `select_tools_with_function_calling()` + executes tools + collects evidence
|
| 79 |
- answer_node: Calls `synthesize_answer()` for factoid generation
|
| 80 |
+
- Updated AgentState with new fields./../test/test_llm_integration.py
|
| 81 |
+
|
| 82 |
+
3. **LLM Integration Tests** ([test/test_llm_integration.py](../../agentbee/test/test_llm_integration.py))
|
| 83 |
|
|
|
|
| 84 |
- 8 tests covering all 3 LLM functions
|
| 85 |
+
- Tests use mocked LLM responses (provider-agno./../test/test_stage3_e2e.py
|
| 86 |
- Full workflow test: planning → tool selection → answer synthesis
|
| 87 |
|
| 88 |
+
4. **E2E Test Script** ([test/test_stage3_e2e.py](../../agentbee/test/test_stage3_e2e.py))
|
| 89 |
- Manual test script for real API testing
|
| 90 |
- Requires ANTHROPIC_API_KEY or GOOGLE_API_KEY
|
| 91 |
- Tests simple math and factual questions
|
test/README.md
CHANGED
|
@@ -2,13 +2,14 @@
|
|
| 2 |
|
| 3 |
**Test Files:**
|
| 4 |
|
| 5 |
-
- [test_agent_basic.py](test_agent_basic.py) - Unit tests for Stage 1 foundation
|
|
|
|
| 6 |
- Agent initialization
|
| 7 |
- Settings loading
|
| 8 |
- Basic question processing
|
| 9 |
- StateGraph structure validation
|
| 10 |
|
| 11 |
-
- [test_stage1.py](test_stage1.py) - Stage 1 integration verification
|
| 12 |
- Configuration validation
|
| 13 |
- Agent initialization
|
| 14 |
- End-to-end question processing
|
|
|
|
| 2 |
|
| 3 |
**Test Files:**
|
| 4 |
|
| 5 |
+
- [test_agent_basic.py](../../agentbee/test/test_agent_basic.py) - Unit tests for Stage 1 foundation
|
| 6 |
+
|
| 7 |
- Agent initialization
|
| 8 |
- Settings loading
|
| 9 |
- Basic question processing
|
| 10 |
- StateGraph structure validation
|
| 11 |
|
| 12 |
+
- [test_stage1.py](../../agentbee/test/test_stage1.py) - Stage 1 integration verification
|
| 13 |
- Configuration validation
|
| 14 |
- Agent initialization
|
| 15 |
- End-to-end question processing
|