chore: folder rename and changelog formatting
Browse files- Rename _template_original/ to project_template_original/ for clarity
- CHANGELOG.md formatting adjustments
Co-Authored-By: Claude <noreply@anthropic.com>
CHANGELOG.md
CHANGED
|
@@ -9,11 +9,13 @@
|
|
| 9 |
**Implementation:**
|
| 10 |
|
| 11 |
1. **Added session log management** (`llm_client.py`)
|
|
|
|
| 12 |
- Module-level `_SESSION_LOG_FILE` variable
|
| 13 |
- `get_session_log_file()` - Creates/reuses session log file
|
| 14 |
- `reset_session_log()` - For testing/new runs
|
| 15 |
|
| 16 |
2. **Changed log file naming**
|
|
|
|
| 17 |
- Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
|
| 18 |
- New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
|
| 19 |
|
|
@@ -23,6 +25,7 @@
|
|
| 23 |
- All questions append to same file
|
| 24 |
|
| 25 |
**Modified Files:**
|
|
|
|
| 26 |
- **src/agent/llm_client.py** (~50 lines modified)
|
| 27 |
- Added session log management functions
|
| 28 |
- Updated `synthesize_answer_hf()` to use session log
|
|
@@ -37,21 +40,25 @@
|
|
| 37 |
**Score:** 30% (6/20 correct) - **First time hitting course target! π**
|
| 38 |
|
| 39 |
**Phase 1 Impact - YouTube + Audio Support:**
|
|
|
|
| 40 |
- **Before:** 10% (2/20 correct)
|
| 41 |
- **After:** 30% (6/20 correct)
|
| 42 |
- **Improvement:** +20% (+4 questions fixed)
|
| 43 |
|
| 44 |
**Questions Fixed by Phase 1:**
|
|
|
|
| 45 |
1. a1e91b78: YouTube bird species (3) β - youtube_transcript + Whisper
|
| 46 |
2. 9d191bce: YouTube Teal'c quote (Extremely) β - youtube_transcript + Whisper
|
| 47 |
3. 99c9cc74: Strawberry pie MP3 (ingredients) β - transcribe_audio (Whisper)
|
| 48 |
4. 1f975693: Calculus MP3 (page numbers) β - transcribe_audio (Whisper)
|
| 49 |
|
| 50 |
**Remaining Issues:**
|
|
|
|
| 51 |
- 3 system errors (vision NoneType, .py execution, calculator)
|
| 52 |
- 10 "Unable to answer" (search evidence extraction issues)
|
| 53 |
|
| 54 |
**Next Priority:**
|
|
|
|
| 55 |
- Fix system errors (vision tool, Python execution)
|
| 56 |
- Improve search answer extraction
|
| 57 |
- Consider Phase 2.5 improvements
|
|
@@ -65,6 +72,7 @@
|
|
| 65 |
**Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
|
| 66 |
|
| 67 |
**Response Format:**
|
|
|
|
| 68 |
```
|
| 69 |
REASONING: [Step-by-step thought process]
|
| 70 |
- What information is in the evidence?
|
|
@@ -78,15 +86,18 @@ FINAL ANSWER: [Factoid answer]
|
|
| 78 |
**Implementation:**
|
| 79 |
|
| 80 |
1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
|
|
|
|
| 81 |
- Request two-part response: REASONING + FINAL ANSWER
|
| 82 |
- Clear examples showing expected format
|
| 83 |
- Instructions for handling insufficient evidence
|
| 84 |
|
| 85 |
2. **Increased max_tokens** from 256 β 1024
|
|
|
|
| 86 |
- Accommodate longer reasoning text
|
| 87 |
- Allow space for both reasoning and answer
|
| 88 |
|
| 89 |
3. **Added parsing logic** to extract FINAL ANSWER
|
|
|
|
| 90 |
- Split response on "FINAL ANSWER:" delimiter
|
| 91 |
- Return only answer to agent (short for UI)
|
| 92 |
- Save full response (with reasoning) to log file
|
|
@@ -97,6 +108,7 @@ FINAL ANSWER: [Factoid answer]
|
|
| 97 |
- Clear separation markers
|
| 98 |
|
| 99 |
**Modified Files:**
|
|
|
|
| 100 |
- **src/agent/llm_client.py** (~50 lines modified)
|
| 101 |
- Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
|
| 102 |
- Updated `synthesize_answer_groq()` - Same changes
|
|
@@ -113,17 +125,20 @@ FINAL ANSWER: [Factoid answer]
|
|
| 113 |
**Solution:** Separated console output (status workflow) from detailed logs (file-based).
|
| 114 |
|
| 115 |
**Console Output (Compressed):**
|
|
|
|
| 116 |
- Status updates: `[plan] β 660 chars`, `[execute] 1 tool(s) selected`, `[answer] β 3`
|
| 117 |
- Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
|
| 118 |
- Success/failure: `β` for success, `β` for failure
|
| 119 |
- File exports: `Context saved to: log/llm_context_*.txt`
|
| 120 |
|
| 121 |
**Log Files (log/ folder):**
|
|
|
|
| 122 |
- `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
|
| 123 |
- `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
|
| 124 |
- Purpose: Post-run analysis, context preservation, debugging
|
| 125 |
|
| 126 |
**Modified Files:**
|
|
|
|
| 127 |
- **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
|
| 128 |
- **src/agent/graph.py** (~50 lines β ~15 lines) - Compressed node logs, removed separators
|
| 129 |
- **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
|
|
@@ -132,6 +147,7 @@ FINAL ANSWER: [Factoid answer]
|
|
| 132 |
- **.gitignore** (+3 lines) - Exclude log/ folder
|
| 133 |
|
| 134 |
**Global Rule Update (~/.claude/CLAUDE.md):**
|
|
|
|
| 135 |
- Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
|
| 136 |
- Removed "logs/" from prohibited folders list
|
| 137 |
- Updated folder purposes table with log/ entry
|
|
@@ -139,6 +155,7 @@ FINAL ANSWER: [Factoid answer]
|
|
| 139 |
**Result:** 16k tokens β ~6.7k tokens (58% reduction)
|
| 140 |
|
| 141 |
**Standard Structure:**
|
|
|
|
| 142 |
```
|
| 143 |
##_ProjectName/
|
| 144 |
βββ archive/ # Previous solutions, references
|
|
@@ -158,6 +175,7 @@ FINAL ANSWER: [Factoid answer]
|
|
| 158 |
**Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
|
| 159 |
|
| 160 |
**Test Result:**
|
|
|
|
| 161 |
```python
|
| 162 |
# Without provider - WORKS but uses HF default routing
|
| 163 |
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # β
Works, but...
|
|
@@ -169,13 +187,13 @@ HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway" # β
Reliable
|
|
| 169 |
|
| 170 |
**Why Auto-Routing is Bad Practice:**
|
| 171 |
|
| 172 |
-
| Issue
|
| 173 |
-
|
| 174 |
-
| **Unpredictable performance** | Provider changes between runs (fast Cerebras β slow Together)
|
| 175 |
-
| **Inconsistent latency**
|
| 176 |
-
| **No cost control**
|
| 177 |
-
| **Debugging nightmare**
|
| 178 |
-
| **Silent failures**
|
| 179 |
|
| 180 |
**Best Practice: ALWAYS specify provider**
|
| 181 |
|
|
@@ -190,6 +208,7 @@ HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
|
|
| 190 |
```
|
| 191 |
|
| 192 |
**Available Providers for Text Models:**
|
|
|
|
| 193 |
- `:scaleway` - Fast, reliable (recommended for Llama)
|
| 194 |
- `:cerebras` - Very fast (recommended for Qwen)
|
| 195 |
- `:novita` - Fast, reputable
|
|
@@ -203,11 +222,13 @@ HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
|
|
| 203 |
## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
|
| 204 |
|
| 205 |
**Model Changes:**
|
|
|
|
| 206 |
1. Qwen 2.5 72B (no provider) β Failed synthesis ("Unable to answer")
|
| 207 |
2. Llama 3.3 70B (Scaleway) β Failed synthesis
|
| 208 |
3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
|
| 209 |
|
| 210 |
**openai/gpt-oss-120b:**
|
|
|
|
| 211 |
- OpenAI's 120B parameter open source model
|
| 212 |
- Strong reasoning capability
|
| 213 |
- Optimized for function calling and tool use
|
|
@@ -390,7 +411,7 @@ submission_data = {
|
|
| 390 |
4. **Our additions are OPTIONAL** - debug/features we added
|
| 391 |
5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
|
| 392 |
|
| 393 |
-
**Reference:** `
|
| 394 |
|
| 395 |
---
|
| 396 |
|
|
@@ -402,7 +423,7 @@ submission_data = {
|
|
| 402 |
|
| 403 |
1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
|
| 404 |
2. Removed git-specific files (`.git/` folder, `.gitattributes`)
|
| 405 |
-
3. Copied to project as `
|
| 406 |
4. Cleaned up temporary clone from Downloads
|
| 407 |
|
| 408 |
**Why Static Reference:**
|
|
@@ -422,18 +443,18 @@ submission_data = {
|
|
| 422 |
|
| 423 |
```bash
|
| 424 |
# Compare file sizes
|
| 425 |
-
ls -lh
|
| 426 |
|
| 427 |
# See differences
|
| 428 |
-
diff
|
| 429 |
|
| 430 |
# Count lines added
|
| 431 |
-
wc -l app.py
|
| 432 |
```
|
| 433 |
|
| 434 |
**Created Files:**
|
| 435 |
|
| 436 |
-
- **\
|
| 437 |
|
| 438 |
---
|
| 439 |
|
|
@@ -462,7 +483,7 @@ wc -l app.py _template_original/app.py
|
|
| 462 |
- Remote: `mangubee/agentbee` (renamed on HuggingFace)
|
| 463 |
- Sync: β
All changes pushed
|
| 464 |
- Git: All commits synced
|
| 465 |
-
- Template: `
|
| 466 |
|
| 467 |
---
|
| 468 |
|
|
|
|
| 9 |
**Implementation:**
|
| 10 |
|
| 11 |
1. **Added session log management** (`llm_client.py`)
|
| 12 |
+
|
| 13 |
- Module-level `_SESSION_LOG_FILE` variable
|
| 14 |
- `get_session_log_file()` - Creates/reuses session log file
|
| 15 |
- `reset_session_log()` - For testing/new runs
|
| 16 |
|
| 17 |
2. **Changed log file naming**
|
| 18 |
+
|
| 19 |
- Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
|
| 20 |
- New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
|
| 21 |
|
|
|
|
| 25 |
- All questions append to same file
|
| 26 |
|
| 27 |
**Modified Files:**
|
| 28 |
+
|
| 29 |
- **src/agent/llm_client.py** (~50 lines modified)
|
| 30 |
- Added session log management functions
|
| 31 |
- Updated `synthesize_answer_hf()` to use session log
|
|
|
|
| 40 |
**Score:** 30% (6/20 correct) - **First time hitting course target! π**
|
| 41 |
|
| 42 |
**Phase 1 Impact - YouTube + Audio Support:**
|
| 43 |
+
|
| 44 |
- **Before:** 10% (2/20 correct)
|
| 45 |
- **After:** 30% (6/20 correct)
|
| 46 |
- **Improvement:** +20% (+4 questions fixed)
|
| 47 |
|
| 48 |
**Questions Fixed by Phase 1:**
|
| 49 |
+
|
| 50 |
1. a1e91b78: YouTube bird species (3) β - youtube_transcript + Whisper
|
| 51 |
2. 9d191bce: YouTube Teal'c quote (Extremely) β - youtube_transcript + Whisper
|
| 52 |
3. 99c9cc74: Strawberry pie MP3 (ingredients) β - transcribe_audio (Whisper)
|
| 53 |
4. 1f975693: Calculus MP3 (page numbers) β - transcribe_audio (Whisper)
|
| 54 |
|
| 55 |
**Remaining Issues:**
|
| 56 |
+
|
| 57 |
- 3 system errors (vision NoneType, .py execution, calculator)
|
| 58 |
- 10 "Unable to answer" (search evidence extraction issues)
|
| 59 |
|
| 60 |
**Next Priority:**
|
| 61 |
+
|
| 62 |
- Fix system errors (vision tool, Python execution)
|
| 63 |
- Improve search answer extraction
|
| 64 |
- Consider Phase 2.5 improvements
|
|
|
|
| 72 |
**Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
|
| 73 |
|
| 74 |
**Response Format:**
|
| 75 |
+
|
| 76 |
```
|
| 77 |
REASONING: [Step-by-step thought process]
|
| 78 |
- What information is in the evidence?
|
|
|
|
| 86 |
**Implementation:**
|
| 87 |
|
| 88 |
1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
|
| 89 |
+
|
| 90 |
- Request two-part response: REASONING + FINAL ANSWER
|
| 91 |
- Clear examples showing expected format
|
| 92 |
- Instructions for handling insufficient evidence
|
| 93 |
|
| 94 |
2. **Increased max_tokens** from 256 β 1024
|
| 95 |
+
|
| 96 |
- Accommodate longer reasoning text
|
| 97 |
- Allow space for both reasoning and answer
|
| 98 |
|
| 99 |
3. **Added parsing logic** to extract FINAL ANSWER
|
| 100 |
+
|
| 101 |
- Split response on "FINAL ANSWER:" delimiter
|
| 102 |
- Return only answer to agent (short for UI)
|
| 103 |
- Save full response (with reasoning) to log file
|
|
|
|
| 108 |
- Clear separation markers
|
| 109 |
|
| 110 |
**Modified Files:**
|
| 111 |
+
|
| 112 |
- **src/agent/llm_client.py** (~50 lines modified)
|
| 113 |
- Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
|
| 114 |
- Updated `synthesize_answer_groq()` - Same changes
|
|
|
|
| 125 |
**Solution:** Separated console output (status workflow) from detailed logs (file-based).
|
| 126 |
|
| 127 |
**Console Output (Compressed):**
|
| 128 |
+
|
| 129 |
- Status updates: `[plan] β 660 chars`, `[execute] 1 tool(s) selected`, `[answer] β 3`
|
| 130 |
- Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
|
| 131 |
- Success/failure: `β` for success, `β` for failure
|
| 132 |
- File exports: `Context saved to: log/llm_context_*.txt`
|
| 133 |
|
| 134 |
**Log Files (log/ folder):**
|
| 135 |
+
|
| 136 |
- `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
|
| 137 |
- `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
|
| 138 |
- Purpose: Post-run analysis, context preservation, debugging
|
| 139 |
|
| 140 |
**Modified Files:**
|
| 141 |
+
|
| 142 |
- **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
|
| 143 |
- **src/agent/graph.py** (~50 lines β ~15 lines) - Compressed node logs, removed separators
|
| 144 |
- **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
|
|
|
|
| 147 |
- **.gitignore** (+3 lines) - Exclude log/ folder
|
| 148 |
|
| 149 |
**Global Rule Update (~/.claude/CLAUDE.md):**
|
| 150 |
+
|
| 151 |
- Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
|
| 152 |
- Removed "logs/" from prohibited folders list
|
| 153 |
- Updated folder purposes table with log/ entry
|
|
|
|
| 155 |
**Result:** 16k tokens β ~6.7k tokens (58% reduction)
|
| 156 |
|
| 157 |
**Standard Structure:**
|
| 158 |
+
|
| 159 |
```
|
| 160 |
##_ProjectName/
|
| 161 |
βββ archive/ # Previous solutions, references
|
|
|
|
| 175 |
**Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
|
| 176 |
|
| 177 |
**Test Result:**
|
| 178 |
+
|
| 179 |
```python
|
| 180 |
# Without provider - WORKS but uses HF default routing
|
| 181 |
HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # β
Works, but...
|
|
|
|
| 187 |
|
| 188 |
**Why Auto-Routing is Bad Practice:**
|
| 189 |
|
| 190 |
+
| Issue | Impact |
|
| 191 |
+
| ----------------------------- | --------------------------------------------------------------- |
|
| 192 |
+
| **Unpredictable performance** | Provider changes between runs (fast Cerebras β slow Together) |
|
| 193 |
+
| **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
|
| 194 |
+
| **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
|
| 195 |
+
| **Debugging nightmare** | Can't reproduce issues when provider is unknown |
|
| 196 |
+
| **Silent failures** | Provider might be down, HF retries with different one |
|
| 197 |
|
| 198 |
**Best Practice: ALWAYS specify provider**
|
| 199 |
|
|
|
|
| 208 |
```
|
| 209 |
|
| 210 |
**Available Providers for Text Models:**
|
| 211 |
+
|
| 212 |
- `:scaleway` - Fast, reliable (recommended for Llama)
|
| 213 |
- `:cerebras` - Very fast (recommended for Qwen)
|
| 214 |
- `:novita` - Fast, reputable
|
|
|
|
| 222 |
## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
|
| 223 |
|
| 224 |
**Model Changes:**
|
| 225 |
+
|
| 226 |
1. Qwen 2.5 72B (no provider) β Failed synthesis ("Unable to answer")
|
| 227 |
2. Llama 3.3 70B (Scaleway) β Failed synthesis
|
| 228 |
3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
|
| 229 |
|
| 230 |
**openai/gpt-oss-120b:**
|
| 231 |
+
|
| 232 |
- OpenAI's 120B parameter open source model
|
| 233 |
- Strong reasoning capability
|
| 234 |
- Optimized for function calling and tool use
|
|
|
|
| 411 |
4. **Our additions are OPTIONAL** - debug/features we added
|
| 412 |
5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
|
| 413 |
|
| 414 |
+
**Reference:** `project_template_original/app.py` for original structure
|
| 415 |
|
| 416 |
---
|
| 417 |
|
|
|
|
| 423 |
|
| 424 |
1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
|
| 425 |
2. Removed git-specific files (`.git/` folder, `.gitattributes`)
|
| 426 |
+
3. Copied to project as `project_template_original/` (static reference, no git)
|
| 427 |
4. Cleaned up temporary clone from Downloads
|
| 428 |
|
| 429 |
**Why Static Reference:**
|
|
|
|
| 443 |
|
| 444 |
```bash
|
| 445 |
# Compare file sizes
|
| 446 |
+
ls -lh project_template_original/app.py app.py
|
| 447 |
|
| 448 |
# See differences
|
| 449 |
+
diff project_template_original/app.py app.py
|
| 450 |
|
| 451 |
# Count lines added
|
| 452 |
+
wc -l app.py project_template_original/app.py
|
| 453 |
```
|
| 454 |
|
| 455 |
**Created Files:**
|
| 456 |
|
| 457 |
+
- **\project_template_original/** (NEW) - Static reference to original template (3 files)
|
| 458 |
|
| 459 |
---
|
| 460 |
|
|
|
|
| 483 |
- Remote: `mangubee/agentbee` (renamed on HuggingFace)
|
| 484 |
- Sync: β
All changes pushed
|
| 485 |
- Git: All commits synced
|
| 486 |
+
- Template: `project_template_original/` added for comparison
|
| 487 |
|
| 488 |
---
|
| 489 |
|
PLAN.md
CHANGED
|
@@ -13,48 +13,50 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
|
|
| 13 |
|
| 14 |
### β
Working (2/20 correct - 10%)
|
| 15 |
|
| 16 |
-
| #
|
| 17 |
-
|
| 18 |
-
| 9
|
| 19 |
-
| 15
|
| 20 |
|
| 21 |
### β οΈ System Errors (6/20 - Technical issues blocking)
|
| 22 |
|
| 23 |
-
| #
|
| 24 |
-
|
| 25 |
-
| **3**
|
| 26 |
-
| **5**
|
| 27 |
-
| **6**
|
| 28 |
-
| **10** | MP3 audio (pie recipe)
|
| 29 |
-
| **12** | Python code execution
|
| 30 |
-
| **13** | MP3 audio (calculus)
|
| 31 |
|
| 32 |
### β LLM Quality Issues (12/20 - AI can't solve)
|
| 33 |
|
| 34 |
-
| #
|
| 35 |
-
|
| 36 |
-
| 1
|
| 37 |
-
| 2
|
| 38 |
-
| 4
|
| 39 |
-
| 7
|
| 40 |
-
| 8
|
| 41 |
-
| 11
|
| 42 |
-
| 14
|
| 43 |
-
| 16
|
| 44 |
-
| 17
|
| 45 |
-
| 18
|
| 46 |
-
| 19
|
| 47 |
-
| 20
|
| 48 |
|
| 49 |
## Strategy
|
| 50 |
|
| 51 |
**Priority 1: Fix System Errors** (unlock 6 questions)
|
|
|
|
| 52 |
- YouTube videos (2 questions) - HIGH impact
|
| 53 |
- MP3 audio (2 questions) - Medium impact
|
| 54 |
- Python execution (1 question) - Low impact
|
| 55 |
- CSV table - LLM issue, not technical
|
| 56 |
|
| 57 |
**Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
|
|
|
|
| 58 |
- Better prompting
|
| 59 |
- Tool selection improvements
|
| 60 |
- Reasoning enhancements
|
|
@@ -66,6 +68,7 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
|
|
| 66 |
**Goal:** Fix questions #3 and #5 (YouTube videos)
|
| 67 |
|
| 68 |
**Root Cause:** Vision tool tries to process YouTube URLs directly, but:
|
|
|
|
| 69 |
- YouTube videos need to be downloaded first
|
| 70 |
- Vision tool expects image files, not video URLs
|
| 71 |
- Need to extract frames or use transcript
|
|
@@ -75,6 +78,7 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
|
|
| 75 |
#### Option A: YouTube Transcript (Recommended)
|
| 76 |
|
| 77 |
**Implementation:**
|
|
|
|
| 78 |
```python
|
| 79 |
# NEW: src/tools/youtube.py
|
| 80 |
import youtube_transcript_api
|
|
@@ -90,12 +94,14 @@ def get_youtube_transcript(video_url: str) -> str:
|
|
| 90 |
```
|
| 91 |
|
| 92 |
**Pros:**
|
|
|
|
| 93 |
- β
Works with current LLM (text-based)
|
| 94 |
- β
Simple API (youtube-transcript-api library)
|
| 95 |
- β
Fast, no video download needed
|
| 96 |
- β
Solves both #3 and #5
|
| 97 |
|
| 98 |
**Cons:**
|
|
|
|
| 99 |
- β Won't work for visual-only questions (but our questions are about content)
|
| 100 |
- β Might not capture visual details
|
| 101 |
|
|
@@ -104,6 +110,7 @@ def get_youtube_transcript(video_url: str) -> str:
|
|
| 104 |
#### Option B: Video Frame Extraction
|
| 105 |
|
| 106 |
**Implementation:**
|
|
|
|
| 107 |
- Download video (yt-dlp)
|
| 108 |
- Extract key frames (OpenCV)
|
| 109 |
- Pass frames to vision tool
|
|
@@ -156,6 +163,7 @@ Target Task ID: a1e91b78-d3d8-4675-bb8d-62741b4b68a6
|
|
| 156 |
**Solution:** Add audio transcription tool
|
| 157 |
|
| 158 |
**Implementation:**
|
|
|
|
| 159 |
```python
|
| 160 |
# NEW: src/tools/audio.py
|
| 161 |
import whisper
|
|
@@ -168,6 +176,7 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 168 |
```
|
| 169 |
|
| 170 |
**Alternative:** HuggingFace audio models (free)
|
|
|
|
| 171 |
- `openai/whisper-base`
|
| 172 |
- Use via Inference API
|
| 173 |
|
|
@@ -191,6 +200,7 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 191 |
**Security Concern:** β οΈ **DANGEROUS** - executing arbitrary Python code
|
| 192 |
|
| 193 |
**Options:**
|
|
|
|
| 194 |
1. **Restricted execution** - Only allow specific operations
|
| 195 |
2. **Docker container** - Isolate execution
|
| 196 |
3. **Skip for now** - Defer due to security concerns
|
|
@@ -210,6 +220,7 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 210 |
**Solution:** This is NOT technical - LLM needs better prompts or tool selection
|
| 211 |
|
| 212 |
**Approaches:**
|
|
|
|
| 213 |
1. Improve system prompt to recognize data in questions
|
| 214 |
2. Add hint in question preprocessing
|
| 215 |
3. Special handling for markdown tables in questions
|
|
@@ -231,12 +242,14 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 231 |
**Vision+Reasoning (1 question):** #7
|
| 232 |
|
| 233 |
**Approaches:**
|
|
|
|
| 234 |
1. **Better prompts** - Emphasize exact answer format
|
| 235 |
2. **Tool selection hints** - Guide LLM to use appropriate tools
|
| 236 |
3. **Few-shot examples** - Show LLM expected answer format
|
| 237 |
4. **Chain-of-thought** - Encourage step-by-step reasoning
|
| 238 |
|
| 239 |
**Implementation:**
|
|
|
|
| 240 |
- Update `synthesize_answer()` prompt
|
| 241 |
- Add answer format examples to system prompt
|
| 242 |
- Improve tool descriptions for better selection
|
|
@@ -246,28 +259,33 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 246 |
## Success Criteria
|
| 247 |
|
| 248 |
### Phase 1: YouTube Support
|
|
|
|
| 249 |
- [ ] YouTube transcript tool implemented
|
| 250 |
- [ ] Question #3 answered correctly (bird species = "3")
|
| 251 |
- [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
|
| 252 |
- [ ] **Score: 10% β 40% (4/20)** β
TARGET REACHED
|
| 253 |
|
| 254 |
### Phase 2: MP3 Support
|
|
|
|
| 255 |
- [ ] Audio transcription tool implemented
|
| 256 |
- [ ] Question #10 answered correctly (pie ingredients)
|
| 257 |
- [ ] Question #13 answered correctly (page numbers)
|
| 258 |
- [ ] **Score: 40% β 50% (10/20)** β
EXCEEDS TARGET
|
| 259 |
|
| 260 |
### Phase 3: Python Execution
|
|
|
|
| 261 |
- [ ] Code execution tool implemented (sandboxed)
|
| 262 |
- [ ] Question #12 answered correctly (output = "0")
|
| 263 |
- [ ] **Score: 50% β 55% (11/20)**
|
| 264 |
|
| 265 |
### Phase 4: CSV Table
|
|
|
|
| 266 |
- [ ] LLM recognizes data in question
|
| 267 |
- [ ] Question #6 answered correctly ("b, e")
|
| 268 |
- [ ] **Score: 55% β 60% (12/20)**
|
| 269 |
|
| 270 |
### Phase 5: LLM Quality
|
|
|
|
| 271 |
- [ ] "Unable to answer" reduced by 50%
|
| 272 |
- [ ] At least 3 more knowledge questions correct
|
| 273 |
- [ ] **Score: 60% β 75%+ (15/20)**
|
|
@@ -275,18 +293,21 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 275 |
## Files to Modify
|
| 276 |
|
| 277 |
### Phase 1: YouTube
|
|
|
|
| 278 |
1. **requirements.txt** - Add `youtube-transcript-api`
|
| 279 |
2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
|
| 280 |
-
3. **src/tools
|
| 281 |
|
| 282 |
### Phase 2: MP3 Audio
|
|
|
|
| 283 |
1. **requirements.txt** - Add `openai-whisper` or HF audio
|
| 284 |
2. **src/tools/audio.py** (NEW) - Audio transcription
|
| 285 |
-
3. **src/tools
|
| 286 |
|
| 287 |
### Phase 3-5: LLM Quality
|
|
|
|
| 288 |
1. **src/agent/graph.py** - Update prompts
|
| 289 |
-
2. **src/tools
|
| 290 |
|
| 291 |
## Removed (Not Relevant)
|
| 292 |
|
|
@@ -301,14 +322,17 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 301 |
## Decision Gates
|
| 302 |
|
| 303 |
**Gate 1 (YouTube):** Does transcript solve both video questions?
|
|
|
|
| 304 |
- **YES:** 40% score, proceed to Phase 2
|
| 305 |
- **NO:** Try frame extraction approach
|
| 306 |
|
| 307 |
**Gate 2 (MP3):** Does transcription solve both audio questions?
|
|
|
|
| 308 |
- **YES:** 50% score, proceed to Phase 3
|
| 309 |
- **NO:** Try different audio model
|
| 310 |
|
| 311 |
**Gate 3 (Target):** Have we reached 30% (6/20)?
|
|
|
|
| 312 |
- **YES:** β
SUCCESS - course target met
|
| 313 |
- **NO:** Continue to Phase 4-5
|
| 314 |
|
|
@@ -330,9 +354,11 @@ def transcribe_audio(file_path: str) -> str:
|
|
| 330 |
## Backup Options
|
| 331 |
|
| 332 |
If YouTube transcript doesn't work:
|
|
|
|
| 333 |
- **Plan B:** Extract video frames, analyze with vision tool
|
| 334 |
- **Plan C:** Skip video questions, focus on other fixes
|
| 335 |
|
| 336 |
If MP3 transcription doesn't work:
|
|
|
|
| 337 |
- **Plan B:** Use HuggingFace audio models
|
| 338 |
- **Plan C:** Skip audio questions, focus on LLM quality
|
|
|
|
| 13 |
|
| 14 |
### β
Working (2/20 correct - 10%)
|
| 15 |
|
| 16 |
+
| # | Task | Status | Issue |
|
| 17 |
+
| --- | -------------------- | ---------- | ----- |
|
| 18 |
+
| 9 | Polish Ray actor | β
Correct | - |
|
| 19 |
+
| 15 | Vietnamese specimens | β
Correct | - |
|
| 20 |
|
| 21 |
### β οΈ System Errors (6/20 - Technical issues blocking)
|
| 22 |
|
| 23 |
+
| # | Task | Error | Type | Priority |
|
| 24 |
+
| ------ | ---------------------------- | ---------------------------------- | ----------- | -------- |
|
| 25 |
+
| **3** | YouTube video (bird species) | Vision tool can't handle URLs | Technical | **HIGH** |
|
| 26 |
+
| **5** | YouTube video (Teal'c) | Vision tool can't handle URLs | Technical | **HIGH** |
|
| 27 |
+
| **6** | CSV table (commutativity) | LLM tries to load `table_data.csv` | LLM Quality | MED |
|
| 28 |
+
| **10** | MP3 audio (pie recipe) | Unsupported file type | Technical | **MED** |
|
| 29 |
+
| **12** | Python code execution | Unsupported file type | Technical | **LOW** |
|
| 30 |
+
| **13** | MP3 audio (calculus) | Unsupported file type | Technical | **MED** |
|
| 31 |
|
| 32 |
### β LLM Quality Issues (12/20 - AI can't solve)
|
| 33 |
|
| 34 |
+
| # | Task | Answer | Expected | Type |
|
| 35 |
+
| --- | --------------------- | ----------------------- | --------------- | ---------------- |
|
| 36 |
+
| 1 | Calculator | "Unable to answer" | Right | Reasoning |
|
| 37 |
+
| 2 | Wikipedia dinosaur | "Scott Hartman" | FunkMonk | Knowledge |
|
| 38 |
+
| 4 | Mercedes Sosa albums | "Unable to answer" | 3 | Knowledge |
|
| 39 |
+
| 7 | Chess position | "Unable to answer" | Rd5 | Vision+Reasoning |
|
| 40 |
+
| 8 | Grocery list (botany) | Wrong (includes fruits) | 5 items | Knowledge |
|
| 41 |
+
| 11 | Equine veterinarian | "Unable to answer" | Louvrier | Knowledge |
|
| 42 |
+
| 14 | NASA award | "Unable to answer" | 80GSFC21M0002 | Knowledge |
|
| 43 |
+
| 16 | Yankee at-bats | "Unable to answer" | 519 | Knowledge |
|
| 44 |
+
| 17 | Pitcher numbers | "Unable to answer" | Yoshida, Uehara | Knowledge |
|
| 45 |
+
| 18 | Olympics athletes | "Unable to answer" | CUB | Knowledge |
|
| 46 |
+
| 19 | Malko Competition | "Unable to answer" | Claus | Knowledge |
|
| 47 |
+
| 20 | Excel sales | "12096.00" | "89706.00" | Calculation |
|
| 48 |
|
| 49 |
## Strategy
|
| 50 |
|
| 51 |
**Priority 1: Fix System Errors** (unlock 6 questions)
|
| 52 |
+
|
| 53 |
- YouTube videos (2 questions) - HIGH impact
|
| 54 |
- MP3 audio (2 questions) - Medium impact
|
| 55 |
- Python execution (1 question) - Low impact
|
| 56 |
- CSV table - LLM issue, not technical
|
| 57 |
|
| 58 |
**Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
|
| 59 |
+
|
| 60 |
- Better prompting
|
| 61 |
- Tool selection improvements
|
| 62 |
- Reasoning enhancements
|
|
|
|
| 68 |
**Goal:** Fix questions #3 and #5 (YouTube videos)
|
| 69 |
|
| 70 |
**Root Cause:** Vision tool tries to process YouTube URLs directly, but:
|
| 71 |
+
|
| 72 |
- YouTube videos need to be downloaded first
|
| 73 |
- Vision tool expects image files, not video URLs
|
| 74 |
- Need to extract frames or use transcript
|
|
|
|
| 78 |
#### Option A: YouTube Transcript (Recommended)
|
| 79 |
|
| 80 |
**Implementation:**
|
| 81 |
+
|
| 82 |
```python
|
| 83 |
# NEW: src/tools/youtube.py
|
| 84 |
import youtube_transcript_api
|
|
|
|
| 94 |
```
|
| 95 |
|
| 96 |
**Pros:**
|
| 97 |
+
|
| 98 |
- β
Works with current LLM (text-based)
|
| 99 |
- β
Simple API (youtube-transcript-api library)
|
| 100 |
- β
Fast, no video download needed
|
| 101 |
- β
Solves both #3 and #5
|
| 102 |
|
| 103 |
**Cons:**
|
| 104 |
+
|
| 105 |
- β Won't work for visual-only questions (but our questions are about content)
|
| 106 |
- β Might not capture visual details
|
| 107 |
|
|
|
|
| 110 |
#### Option B: Video Frame Extraction
|
| 111 |
|
| 112 |
**Implementation:**
|
| 113 |
+
|
| 114 |
- Download video (yt-dlp)
|
| 115 |
- Extract key frames (OpenCV)
|
| 116 |
- Pass frames to vision tool
|
|
|
|
| 163 |
**Solution:** Add audio transcription tool
|
| 164 |
|
| 165 |
**Implementation:**
|
| 166 |
+
|
| 167 |
```python
|
| 168 |
# NEW: src/tools/audio.py
|
| 169 |
import whisper
|
|
|
|
| 176 |
```
|
| 177 |
|
| 178 |
**Alternative:** HuggingFace audio models (free)
|
| 179 |
+
|
| 180 |
- `openai/whisper-base`
|
| 181 |
- Use via Inference API
|
| 182 |
|
|
|
|
| 200 |
**Security Concern:** β οΈ **DANGEROUS** - executing arbitrary Python code
|
| 201 |
|
| 202 |
**Options:**
|
| 203 |
+
|
| 204 |
1. **Restricted execution** - Only allow specific operations
|
| 205 |
2. **Docker container** - Isolate execution
|
| 206 |
3. **Skip for now** - Defer due to security concerns
|
|
|
|
| 220 |
**Solution:** This is NOT technical - LLM needs better prompts or tool selection
|
| 221 |
|
| 222 |
**Approaches:**
|
| 223 |
+
|
| 224 |
1. Improve system prompt to recognize data in questions
|
| 225 |
2. Add hint in question preprocessing
|
| 226 |
3. Special handling for markdown tables in questions
|
|
|
|
| 242 |
**Vision+Reasoning (1 question):** #7
|
| 243 |
|
| 244 |
**Approaches:**
|
| 245 |
+
|
| 246 |
1. **Better prompts** - Emphasize exact answer format
|
| 247 |
2. **Tool selection hints** - Guide LLM to use appropriate tools
|
| 248 |
3. **Few-shot examples** - Show LLM expected answer format
|
| 249 |
4. **Chain-of-thought** - Encourage step-by-step reasoning
|
| 250 |
|
| 251 |
**Implementation:**
|
| 252 |
+
|
| 253 |
- Update `synthesize_answer()` prompt
|
| 254 |
- Add answer format examples to system prompt
|
| 255 |
- Improve tool descriptions for better selection
|
|
|
|
| 259 |
## Success Criteria
|
| 260 |
|
| 261 |
### Phase 1: YouTube Support
|
| 262 |
+
|
| 263 |
- [ ] YouTube transcript tool implemented
|
| 264 |
- [ ] Question #3 answered correctly (bird species = "3")
|
| 265 |
- [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
|
| 266 |
- [ ] **Score: 10% β 40% (4/20)** β
TARGET REACHED
|
| 267 |
|
| 268 |
### Phase 2: MP3 Support
|
| 269 |
+
|
| 270 |
- [ ] Audio transcription tool implemented
|
| 271 |
- [ ] Question #10 answered correctly (pie ingredients)
|
| 272 |
- [ ] Question #13 answered correctly (page numbers)
|
| 273 |
- [ ] **Score: 40% β 50% (10/20)** β
EXCEEDS TARGET
|
| 274 |
|
| 275 |
### Phase 3: Python Execution
|
| 276 |
+
|
| 277 |
- [ ] Code execution tool implemented (sandboxed)
|
| 278 |
- [ ] Question #12 answered correctly (output = "0")
|
| 279 |
- [ ] **Score: 50% β 55% (11/20)**
|
| 280 |
|
| 281 |
### Phase 4: CSV Table
|
| 282 |
+
|
| 283 |
- [ ] LLM recognizes data in question
|
| 284 |
- [ ] Question #6 answered correctly ("b, e")
|
| 285 |
- [ ] **Score: 55% β 60% (12/20)**
|
| 286 |
|
| 287 |
### Phase 5: LLM Quality
|
| 288 |
+
|
| 289 |
- [ ] "Unable to answer" reduced by 50%
|
| 290 |
- [ ] At least 3 more knowledge questions correct
|
| 291 |
- [ ] **Score: 60% β 75%+ (15/20)**
|
|
|
|
| 293 |
## Files to Modify
|
| 294 |
|
| 295 |
### Phase 1: YouTube
|
| 296 |
+
|
| 297 |
1. **requirements.txt** - Add `youtube-transcript-api`
|
| 298 |
2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
|
| 299 |
+
3. **src/tools/**init**.py** - Register youtube_transcript tool
|
| 300 |
|
| 301 |
### Phase 2: MP3 Audio
|
| 302 |
+
|
| 303 |
1. **requirements.txt** - Add `openai-whisper` or HF audio
|
| 304 |
2. **src/tools/audio.py** (NEW) - Audio transcription
|
| 305 |
+
3. **src/tools/**init**.py** - Register transcribe_audio tool
|
| 306 |
|
| 307 |
### Phase 3-5: LLM Quality
|
| 308 |
+
|
| 309 |
1. **src/agent/graph.py** - Update prompts
|
| 310 |
+
2. **src/tools/**init**.py** - Improve tool descriptions
|
| 311 |
|
| 312 |
## Removed (Not Relevant)
|
| 313 |
|
|
|
|
| 322 |
## Decision Gates
|
| 323 |
|
| 324 |
**Gate 1 (YouTube):** Does transcript solve both video questions?
|
| 325 |
+
|
| 326 |
- **YES:** 40% score, proceed to Phase 2
|
| 327 |
- **NO:** Try frame extraction approach
|
| 328 |
|
| 329 |
**Gate 2 (MP3):** Does transcription solve both audio questions?
|
| 330 |
+
|
| 331 |
- **YES:** 50% score, proceed to Phase 3
|
| 332 |
- **NO:** Try different audio model
|
| 333 |
|
| 334 |
**Gate 3 (Target):** Have we reached 30% (6/20)?
|
| 335 |
+
|
| 336 |
- **YES:** β
SUCCESS - course target met
|
| 337 |
- **NO:** Continue to Phase 4-5
|
| 338 |
|
|
|
|
| 354 |
## Backup Options
|
| 355 |
|
| 356 |
If YouTube transcript doesn't work:
|
| 357 |
+
|
| 358 |
- **Plan B:** Extract video frames, analyze with vision tool
|
| 359 |
- **Plan C:** Skip video questions, focus on other fixes
|
| 360 |
|
| 361 |
If MP3 transcription doesn't work:
|
| 362 |
+
|
| 363 |
- **Plan B:** Use HuggingFace audio models
|
| 364 |
- **Plan C:** Skip audio questions, focus on LLM quality
|
{_template_original β project_template_original}/README.md
RENAMED
|
File without changes
|
{_template_original β project_template_original}/app.py
RENAMED
|
File without changes
|
{_template_original β project_template_original}/requirements.txt
RENAMED
|
File without changes
|