mangubee Claude commited on
Commit
8eacd1b
Β·
1 Parent(s): a0fa418

chore: folder rename and changelog formatting

Browse files

- Rename _template_original/ to project_template_original/ for clarity
- CHANGELOG.md formatting adjustments

Co-Authored-By: Claude <noreply@anthropic.com>

CHANGELOG.md CHANGED
@@ -9,11 +9,13 @@
9
  **Implementation:**
10
 
11
  1. **Added session log management** (`llm_client.py`)
 
12
  - Module-level `_SESSION_LOG_FILE` variable
13
  - `get_session_log_file()` - Creates/reuses session log file
14
  - `reset_session_log()` - For testing/new runs
15
 
16
  2. **Changed log file naming**
 
17
  - Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
18
  - New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
19
 
@@ -23,6 +25,7 @@
23
  - All questions append to same file
24
 
25
  **Modified Files:**
 
26
  - **src/agent/llm_client.py** (~50 lines modified)
27
  - Added session log management functions
28
  - Updated `synthesize_answer_hf()` to use session log
@@ -37,21 +40,25 @@
37
  **Score:** 30% (6/20 correct) - **First time hitting course target! πŸŽ‰**
38
 
39
  **Phase 1 Impact - YouTube + Audio Support:**
 
40
  - **Before:** 10% (2/20 correct)
41
  - **After:** 30% (6/20 correct)
42
  - **Improvement:** +20% (+4 questions fixed)
43
 
44
  **Questions Fixed by Phase 1:**
 
45
  1. a1e91b78: YouTube bird species (3) βœ“ - youtube_transcript + Whisper
46
  2. 9d191bce: YouTube Teal'c quote (Extremely) βœ“ - youtube_transcript + Whisper
47
  3. 99c9cc74: Strawberry pie MP3 (ingredients) βœ“ - transcribe_audio (Whisper)
48
  4. 1f975693: Calculus MP3 (page numbers) βœ“ - transcribe_audio (Whisper)
49
 
50
  **Remaining Issues:**
 
51
  - 3 system errors (vision NoneType, .py execution, calculator)
52
  - 10 "Unable to answer" (search evidence extraction issues)
53
 
54
  **Next Priority:**
 
55
  - Fix system errors (vision tool, Python execution)
56
  - Improve search answer extraction
57
  - Consider Phase 2.5 improvements
@@ -65,6 +72,7 @@
65
  **Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
66
 
67
  **Response Format:**
 
68
  ```
69
  REASONING: [Step-by-step thought process]
70
  - What information is in the evidence?
@@ -78,15 +86,18 @@ FINAL ANSWER: [Factoid answer]
78
  **Implementation:**
79
 
80
  1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
 
81
  - Request two-part response: REASONING + FINAL ANSWER
82
  - Clear examples showing expected format
83
  - Instructions for handling insufficient evidence
84
 
85
  2. **Increased max_tokens** from 256 β†’ 1024
 
86
  - Accommodate longer reasoning text
87
  - Allow space for both reasoning and answer
88
 
89
  3. **Added parsing logic** to extract FINAL ANSWER
 
90
  - Split response on "FINAL ANSWER:" delimiter
91
  - Return only answer to agent (short for UI)
92
  - Save full response (with reasoning) to log file
@@ -97,6 +108,7 @@ FINAL ANSWER: [Factoid answer]
97
  - Clear separation markers
98
 
99
  **Modified Files:**
 
100
  - **src/agent/llm_client.py** (~50 lines modified)
101
  - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
102
  - Updated `synthesize_answer_groq()` - Same changes
@@ -113,17 +125,20 @@ FINAL ANSWER: [Factoid answer]
113
  **Solution:** Separated console output (status workflow) from detailed logs (file-based).
114
 
115
  **Console Output (Compressed):**
 
116
  - Status updates: `[plan] βœ“ 660 chars`, `[execute] 1 tool(s) selected`, `[answer] βœ“ 3`
117
  - Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
118
  - Success/failure: `βœ“` for success, `βœ—` for failure
119
  - File exports: `Context saved to: log/llm_context_*.txt`
120
 
121
  **Log Files (log/ folder):**
 
122
  - `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
123
  - `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
124
  - Purpose: Post-run analysis, context preservation, debugging
125
 
126
  **Modified Files:**
 
127
  - **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
128
  - **src/agent/graph.py** (~50 lines β†’ ~15 lines) - Compressed node logs, removed separators
129
  - **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
@@ -132,6 +147,7 @@ FINAL ANSWER: [Factoid answer]
132
  - **.gitignore** (+3 lines) - Exclude log/ folder
133
 
134
  **Global Rule Update (~/.claude/CLAUDE.md):**
 
135
  - Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
136
  - Removed "logs/" from prohibited folders list
137
  - Updated folder purposes table with log/ entry
@@ -139,6 +155,7 @@ FINAL ANSWER: [Factoid answer]
139
  **Result:** 16k tokens β†’ ~6.7k tokens (58% reduction)
140
 
141
  **Standard Structure:**
 
142
  ```
143
  ##_ProjectName/
144
  β”œβ”€β”€ archive/ # Previous solutions, references
@@ -158,6 +175,7 @@ FINAL ANSWER: [Factoid answer]
158
  **Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
159
 
160
  **Test Result:**
 
161
  ```python
162
  # Without provider - WORKS but uses HF default routing
163
  HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # βœ… Works, but...
@@ -169,13 +187,13 @@ HF_MODEL = "meta-llama/Llama-3.3-70B-Instruct:scaleway" # βœ… Reliable
169
 
170
  **Why Auto-Routing is Bad Practice:**
171
 
172
- | Issue | Impact |
173
- |-------|--------|
174
- | **Unpredictable performance** | Provider changes between runs (fast Cerebras β†’ slow Together) |
175
- | **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
176
- | **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
177
- | **Debugging nightmare** | Can't reproduce issues when provider is unknown |
178
- | **Silent failures** | Provider might be down, HF retries with different one |
179
 
180
  **Best Practice: ALWAYS specify provider**
181
 
@@ -190,6 +208,7 @@ HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
190
  ```
191
 
192
  **Available Providers for Text Models:**
 
193
  - `:scaleway` - Fast, reliable (recommended for Llama)
194
  - `:cerebras` - Very fast (recommended for Qwen)
195
  - `:novita` - Fast, reputable
@@ -203,11 +222,13 @@ HF_MODEL = "meta-llama/Llama-3.1-70B-Instruct:novita"
203
  ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
204
 
205
  **Model Changes:**
 
206
  1. Qwen 2.5 72B (no provider) β†’ Failed synthesis ("Unable to answer")
207
  2. Llama 3.3 70B (Scaleway) β†’ Failed synthesis
208
  3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
209
 
210
  **openai/gpt-oss-120b:**
 
211
  - OpenAI's 120B parameter open source model
212
  - Strong reasoning capability
213
  - Optimized for function calling and tool use
@@ -390,7 +411,7 @@ submission_data = {
390
  4. **Our additions are OPTIONAL** - debug/features we added
391
  5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
392
 
393
- **Reference:** `_template_original/app.py` for original structure
394
 
395
  ---
396
 
@@ -402,7 +423,7 @@ submission_data = {
402
 
403
  1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
404
  2. Removed git-specific files (`.git/` folder, `.gitattributes`)
405
- 3. Copied to project as `_template_original/` (static reference, no git)
406
  4. Cleaned up temporary clone from Downloads
407
 
408
  **Why Static Reference:**
@@ -422,18 +443,18 @@ submission_data = {
422
 
423
  ```bash
424
  # Compare file sizes
425
- ls -lh _template_original/app.py app.py
426
 
427
  # See differences
428
- diff _template_original/app.py app.py
429
 
430
  # Count lines added
431
- wc -l app.py _template_original/app.py
432
  ```
433
 
434
  **Created Files:**
435
 
436
- - **\_template_original/** (NEW) - Static reference to original template (3 files)
437
 
438
  ---
439
 
@@ -462,7 +483,7 @@ wc -l app.py _template_original/app.py
462
  - Remote: `mangubee/agentbee` (renamed on HuggingFace)
463
  - Sync: βœ… All changes pushed
464
  - Git: All commits synced
465
- - Template: `_template_original/` added for comparison
466
 
467
  ---
468
 
 
9
  **Implementation:**
10
 
11
  1. **Added session log management** (`llm_client.py`)
12
+
13
  - Module-level `_SESSION_LOG_FILE` variable
14
  - `get_session_log_file()` - Creates/reuses session log file
15
  - `reset_session_log()` - For testing/new runs
16
 
17
  2. **Changed log file naming**
18
+
19
  - Old: `log/llm_context_YYYYMMDD_HHMMSS.txt` (per question)
20
  - New: `log/llm_session_YYYYMMDD_HHMMSS.txt` (per evaluation run)
21
 
 
25
  - All questions append to same file
26
 
27
  **Modified Files:**
28
+
29
  - **src/agent/llm_client.py** (~50 lines modified)
30
  - Added session log management functions
31
  - Updated `synthesize_answer_hf()` to use session log
 
40
  **Score:** 30% (6/20 correct) - **First time hitting course target! πŸŽ‰**
41
 
42
  **Phase 1 Impact - YouTube + Audio Support:**
43
+
44
  - **Before:** 10% (2/20 correct)
45
  - **After:** 30% (6/20 correct)
46
  - **Improvement:** +20% (+4 questions fixed)
47
 
48
  **Questions Fixed by Phase 1:**
49
+
50
  1. a1e91b78: YouTube bird species (3) βœ“ - youtube_transcript + Whisper
51
  2. 9d191bce: YouTube Teal'c quote (Extremely) βœ“ - youtube_transcript + Whisper
52
  3. 99c9cc74: Strawberry pie MP3 (ingredients) βœ“ - transcribe_audio (Whisper)
53
  4. 1f975693: Calculus MP3 (page numbers) βœ“ - transcribe_audio (Whisper)
54
 
55
  **Remaining Issues:**
56
+
57
  - 3 system errors (vision NoneType, .py execution, calculator)
58
  - 10 "Unable to answer" (search evidence extraction issues)
59
 
60
  **Next Priority:**
61
+
62
  - Fix system errors (vision tool, Python execution)
63
  - Improve search answer extraction
64
  - Consider Phase 2.5 improvements
 
72
  **Solution:** Implemented Chain of Thought (CoT) format - LLM now provides reasoning before final answer.
73
 
74
  **Response Format:**
75
+
76
  ```
77
  REASONING: [Step-by-step thought process]
78
  - What information is in the evidence?
 
86
  **Implementation:**
87
 
88
  1. **Updated system_prompt** (all 3 providers: HF, Groq, Claude)
89
+
90
  - Request two-part response: REASONING + FINAL ANSWER
91
  - Clear examples showing expected format
92
  - Instructions for handling insufficient evidence
93
 
94
  2. **Increased max_tokens** from 256 β†’ 1024
95
+
96
  - Accommodate longer reasoning text
97
  - Allow space for both reasoning and answer
98
 
99
  3. **Added parsing logic** to extract FINAL ANSWER
100
+
101
  - Split response on "FINAL ANSWER:" delimiter
102
  - Return only answer to agent (short for UI)
103
  - Save full response (with reasoning) to log file
 
108
  - Clear separation markers
109
 
110
  **Modified Files:**
111
+
112
  - **src/agent/llm_client.py** (~50 lines modified)
113
  - Updated `synthesize_answer_hf()` - CoT prompt, max_tokens=1024, parsing
114
  - Updated `synthesize_answer_groq()` - Same changes
 
125
  **Solution:** Separated console output (status workflow) from detailed logs (file-based).
126
 
127
  **Console Output (Compressed):**
128
+
129
  - Status updates: `[plan] βœ“ 660 chars`, `[execute] 1 tool(s) selected`, `[answer] βœ“ 3`
130
  - Progress indicators: `[1/1] Processing a1e91b78`, `[1/20]` for batch
131
  - Success/failure: `βœ“` for success, `βœ—` for failure
132
  - File exports: `Context saved to: log/llm_context_*.txt`
133
 
134
  **Log Files (log/ folder):**
135
+
136
  - `llm_context_TIMESTAMP.txt` - Full LLM prompts, evidence, answers
137
  - `{video_id}_transcript.txt` - Raw transcripts from YouTube/Whisper
138
  - Purpose: Post-run analysis, context preservation, debugging
139
 
140
  **Modified Files:**
141
+
142
  - **app.py** (~4 lines) - Suppress httpx, urllib3, huggingface_hub, gradio logs to WARNING
143
  - **src/agent/graph.py** (~50 lines β†’ ~15 lines) - Compressed node logs, removed separators
144
  - **src/agent/llm_client.py** (~20 lines) - Save LLM context to log/ folder
 
147
  - **.gitignore** (+3 lines) - Exclude log/ folder
148
 
149
  **Global Rule Update (~/.claude/CLAUDE.md):**
150
+
151
  - Added `log/` to standard project structure (archive/, input/, output/, log/, test/, dev/)
152
  - Removed "logs/" from prohibited folders list
153
  - Updated folder purposes table with log/ entry
 
155
  **Result:** 16k tokens β†’ ~6.7k tokens (58% reduction)
156
 
157
  **Standard Structure:**
158
+
159
  ```
160
  ##_ProjectName/
161
  β”œβ”€β”€ archive/ # Previous solutions, references
 
175
  **Finding:** Models WITHOUT `:provider` suffix work via HF auto-routing, but this is unreliable.
176
 
177
  **Test Result:**
178
+
179
  ```python
180
  # Without provider - WORKS but uses HF default routing
181
  HF_MODEL = "Qwen/Qwen2.5-72B-Instruct" # βœ… Works, but...
 
187
 
188
  **Why Auto-Routing is Bad Practice:**
189
 
190
+ | Issue | Impact |
191
+ | ----------------------------- | --------------------------------------------------------------- |
192
+ | **Unpredictable performance** | Provider changes between runs (fast Cerebras β†’ slow Together) |
193
+ | **Inconsistent latency** | 2s one run, 20s next run (different provider selected) |
194
+ | **No cost control** | Can't choose cheaper providers (Cerebras/Scaleway vs expensive) |
195
+ | **Debugging nightmare** | Can't reproduce issues when provider is unknown |
196
+ | **Silent failures** | Provider might be down, HF retries with different one |
197
 
198
  **Best Practice: ALWAYS specify provider**
199
 
 
208
  ```
209
 
210
  **Available Providers for Text Models:**
211
+
212
  - `:scaleway` - Fast, reliable (recommended for Llama)
213
  - `:cerebras` - Very fast (recommended for Qwen)
214
  - `:novita` - Fast, reputable
 
222
  ## [2026-01-13] [Stage 1: YouTube Support] [IN PROGRESS] LLM Synthesis Model Iteration
223
 
224
  **Model Changes:**
225
+
226
  1. Qwen 2.5 72B (no provider) β†’ Failed synthesis ("Unable to answer")
227
  2. Llama 3.3 70B (Scaleway) β†’ Failed synthesis
228
  3. **Current:** openai/gpt-oss-120b (Scaleway) - Testing
229
 
230
  **openai/gpt-oss-120b:**
231
+
232
  - OpenAI's 120B parameter open source model
233
  - Strong reasoning capability
234
  - Optimized for function calling and tool use
 
411
  4. **Our additions are OPTIONAL** - debug/features we added
412
  5. **Original template is 8777 bytes** - ours is 32722 bytes (4x larger)
413
 
414
+ **Reference:** `project_template_original/app.py` for original structure
415
 
416
  ---
417
 
 
423
 
424
  1. Cloned original template to `/Users/mangubee/Downloads/Final_Assignment_Template`
425
  2. Removed git-specific files (`.git/` folder, `.gitattributes`)
426
+ 3. Copied to project as `project_template_original/` (static reference, no git)
427
  4. Cleaned up temporary clone from Downloads
428
 
429
  **Why Static Reference:**
 
443
 
444
  ```bash
445
  # Compare file sizes
446
+ ls -lh project_template_original/app.py app.py
447
 
448
  # See differences
449
+ diff project_template_original/app.py app.py
450
 
451
  # Count lines added
452
+ wc -l app.py project_template_original/app.py
453
  ```
454
 
455
  **Created Files:**
456
 
457
+ - **\project_template_original/** (NEW) - Static reference to original template (3 files)
458
 
459
  ---
460
 
 
483
  - Remote: `mangubee/agentbee` (renamed on HuggingFace)
484
  - Sync: βœ… All changes pushed
485
  - Git: All commits synced
486
+ - Template: `project_template_original/` added for comparison
487
 
488
  ---
489
 
PLAN.md CHANGED
@@ -13,48 +13,50 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
13
 
14
  ### βœ… Working (2/20 correct - 10%)
15
 
16
- | # | Task | Status | Issue |
17
- |---|------|--------|-------|
18
- | 9 | Polish Ray actor | βœ… Correct | - |
19
- | 15 | Vietnamese specimens | βœ… Correct | - |
20
 
21
  ### ⚠️ System Errors (6/20 - Technical issues blocking)
22
 
23
- | # | Task | Error | Type | Priority |
24
- |---|------|-------|------|----------|
25
- | **3** | YouTube video (bird species) | Vision tool can't handle URLs | Technical | **HIGH** |
26
- | **5** | YouTube video (Teal'c) | Vision tool can't handle URLs | Technical | **HIGH** |
27
- | **6** | CSV table (commutativity) | LLM tries to load `table_data.csv` | LLM Quality | MED |
28
- | **10** | MP3 audio (pie recipe) | Unsupported file type | Technical | **MED** |
29
- | **12** | Python code execution | Unsupported file type | Technical | **LOW** |
30
- | **13** | MP3 audio (calculus) | Unsupported file type | Technical | **MED** |
31
 
32
  ### ❌ LLM Quality Issues (12/20 - AI can't solve)
33
 
34
- | # | Task | Answer | Expected | Type |
35
- |---|------|--------|----------|------|
36
- | 1 | Calculator | "Unable to answer" | Right | Reasoning |
37
- | 2 | Wikipedia dinosaur | "Scott Hartman" | FunkMonk | Knowledge |
38
- | 4 | Mercedes Sosa albums | "Unable to answer" | 3 | Knowledge |
39
- | 7 | Chess position | "Unable to answer" | Rd5 | Vision+Reasoning |
40
- | 8 | Grocery list (botany) | Wrong (includes fruits) | 5 items | Knowledge |
41
- | 11 | Equine veterinarian | "Unable to answer" | Louvrier | Knowledge |
42
- | 14 | NASA award | "Unable to answer" | 80GSFC21M0002 | Knowledge |
43
- | 16 | Yankee at-bats | "Unable to answer" | 519 | Knowledge |
44
- | 17 | Pitcher numbers | "Unable to answer" | Yoshida, Uehara | Knowledge |
45
- | 18 | Olympics athletes | "Unable to answer" | CUB | Knowledge |
46
- | 19 | Malko Competition | "Unable to answer" | Claus | Knowledge |
47
- | 20 | Excel sales | "12096.00" | "89706.00" | Calculation |
48
 
49
  ## Strategy
50
 
51
  **Priority 1: Fix System Errors** (unlock 6 questions)
 
52
  - YouTube videos (2 questions) - HIGH impact
53
  - MP3 audio (2 questions) - Medium impact
54
  - Python execution (1 question) - Low impact
55
  - CSV table - LLM issue, not technical
56
 
57
  **Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
 
58
  - Better prompting
59
  - Tool selection improvements
60
  - Reasoning enhancements
@@ -66,6 +68,7 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
66
  **Goal:** Fix questions #3 and #5 (YouTube videos)
67
 
68
  **Root Cause:** Vision tool tries to process YouTube URLs directly, but:
 
69
  - YouTube videos need to be downloaded first
70
  - Vision tool expects image files, not video URLs
71
  - Need to extract frames or use transcript
@@ -75,6 +78,7 @@ Fix remaining 6 system errors to unlock questions, then address LLM quality issu
75
  #### Option A: YouTube Transcript (Recommended)
76
 
77
  **Implementation:**
 
78
  ```python
79
  # NEW: src/tools/youtube.py
80
  import youtube_transcript_api
@@ -90,12 +94,14 @@ def get_youtube_transcript(video_url: str) -> str:
90
  ```
91
 
92
  **Pros:**
 
93
  - βœ… Works with current LLM (text-based)
94
  - βœ… Simple API (youtube-transcript-api library)
95
  - βœ… Fast, no video download needed
96
  - βœ… Solves both #3 and #5
97
 
98
  **Cons:**
 
99
  - ❌ Won't work for visual-only questions (but our questions are about content)
100
  - ❌ Might not capture visual details
101
 
@@ -104,6 +110,7 @@ def get_youtube_transcript(video_url: str) -> str:
104
  #### Option B: Video Frame Extraction
105
 
106
  **Implementation:**
 
107
  - Download video (yt-dlp)
108
  - Extract key frames (OpenCV)
109
  - Pass frames to vision tool
@@ -156,6 +163,7 @@ Target Task ID: a1e91b78-d3d8-4675-bb8d-62741b4b68a6
156
  **Solution:** Add audio transcription tool
157
 
158
  **Implementation:**
 
159
  ```python
160
  # NEW: src/tools/audio.py
161
  import whisper
@@ -168,6 +176,7 @@ def transcribe_audio(file_path: str) -> str:
168
  ```
169
 
170
  **Alternative:** HuggingFace audio models (free)
 
171
  - `openai/whisper-base`
172
  - Use via Inference API
173
 
@@ -191,6 +200,7 @@ def transcribe_audio(file_path: str) -> str:
191
  **Security Concern:** ⚠️ **DANGEROUS** - executing arbitrary Python code
192
 
193
  **Options:**
 
194
  1. **Restricted execution** - Only allow specific operations
195
  2. **Docker container** - Isolate execution
196
  3. **Skip for now** - Defer due to security concerns
@@ -210,6 +220,7 @@ def transcribe_audio(file_path: str) -> str:
210
  **Solution:** This is NOT technical - LLM needs better prompts or tool selection
211
 
212
  **Approaches:**
 
213
  1. Improve system prompt to recognize data in questions
214
  2. Add hint in question preprocessing
215
  3. Special handling for markdown tables in questions
@@ -231,12 +242,14 @@ def transcribe_audio(file_path: str) -> str:
231
  **Vision+Reasoning (1 question):** #7
232
 
233
  **Approaches:**
 
234
  1. **Better prompts** - Emphasize exact answer format
235
  2. **Tool selection hints** - Guide LLM to use appropriate tools
236
  3. **Few-shot examples** - Show LLM expected answer format
237
  4. **Chain-of-thought** - Encourage step-by-step reasoning
238
 
239
  **Implementation:**
 
240
  - Update `synthesize_answer()` prompt
241
  - Add answer format examples to system prompt
242
  - Improve tool descriptions for better selection
@@ -246,28 +259,33 @@ def transcribe_audio(file_path: str) -> str:
246
  ## Success Criteria
247
 
248
  ### Phase 1: YouTube Support
 
249
  - [ ] YouTube transcript tool implemented
250
  - [ ] Question #3 answered correctly (bird species = "3")
251
  - [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
252
  - [ ] **Score: 10% β†’ 40% (4/20)** βœ… TARGET REACHED
253
 
254
  ### Phase 2: MP3 Support
 
255
  - [ ] Audio transcription tool implemented
256
  - [ ] Question #10 answered correctly (pie ingredients)
257
  - [ ] Question #13 answered correctly (page numbers)
258
  - [ ] **Score: 40% β†’ 50% (10/20)** βœ… EXCEEDS TARGET
259
 
260
  ### Phase 3: Python Execution
 
261
  - [ ] Code execution tool implemented (sandboxed)
262
  - [ ] Question #12 answered correctly (output = "0")
263
  - [ ] **Score: 50% β†’ 55% (11/20)**
264
 
265
  ### Phase 4: CSV Table
 
266
  - [ ] LLM recognizes data in question
267
  - [ ] Question #6 answered correctly ("b, e")
268
  - [ ] **Score: 55% β†’ 60% (12/20)**
269
 
270
  ### Phase 5: LLM Quality
 
271
  - [ ] "Unable to answer" reduced by 50%
272
  - [ ] At least 3 more knowledge questions correct
273
  - [ ] **Score: 60% β†’ 75%+ (15/20)**
@@ -275,18 +293,21 @@ def transcribe_audio(file_path: str) -> str:
275
  ## Files to Modify
276
 
277
  ### Phase 1: YouTube
 
278
  1. **requirements.txt** - Add `youtube-transcript-api`
279
  2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
280
- 3. **src/tools/__init__.py** - Register youtube_transcript tool
281
 
282
  ### Phase 2: MP3 Audio
 
283
  1. **requirements.txt** - Add `openai-whisper` or HF audio
284
  2. **src/tools/audio.py** (NEW) - Audio transcription
285
- 3. **src/tools/__init__.py** - Register transcribe_audio tool
286
 
287
  ### Phase 3-5: LLM Quality
 
288
  1. **src/agent/graph.py** - Update prompts
289
- 2. **src/tools/__init__.py** - Improve tool descriptions
290
 
291
  ## Removed (Not Relevant)
292
 
@@ -301,14 +322,17 @@ def transcribe_audio(file_path: str) -> str:
301
  ## Decision Gates
302
 
303
  **Gate 1 (YouTube):** Does transcript solve both video questions?
 
304
  - **YES:** 40% score, proceed to Phase 2
305
  - **NO:** Try frame extraction approach
306
 
307
  **Gate 2 (MP3):** Does transcription solve both audio questions?
 
308
  - **YES:** 50% score, proceed to Phase 3
309
  - **NO:** Try different audio model
310
 
311
  **Gate 3 (Target):** Have we reached 30% (6/20)?
 
312
  - **YES:** βœ… SUCCESS - course target met
313
  - **NO:** Continue to Phase 4-5
314
 
@@ -330,9 +354,11 @@ def transcribe_audio(file_path: str) -> str:
330
  ## Backup Options
331
 
332
  If YouTube transcript doesn't work:
 
333
  - **Plan B:** Extract video frames, analyze with vision tool
334
  - **Plan C:** Skip video questions, focus on other fixes
335
 
336
  If MP3 transcription doesn't work:
 
337
  - **Plan B:** Use HuggingFace audio models
338
  - **Plan C:** Skip audio questions, focus on LLM quality
 
13
 
14
  ### βœ… Working (2/20 correct - 10%)
15
 
16
+ | # | Task | Status | Issue |
17
+ | --- | -------------------- | ---------- | ----- |
18
+ | 9 | Polish Ray actor | βœ… Correct | - |
19
+ | 15 | Vietnamese specimens | βœ… Correct | - |
20
 
21
  ### ⚠️ System Errors (6/20 - Technical issues blocking)
22
 
23
+ | # | Task | Error | Type | Priority |
24
+ | ------ | ---------------------------- | ---------------------------------- | ----------- | -------- |
25
+ | **3** | YouTube video (bird species) | Vision tool can't handle URLs | Technical | **HIGH** |
26
+ | **5** | YouTube video (Teal'c) | Vision tool can't handle URLs | Technical | **HIGH** |
27
+ | **6** | CSV table (commutativity) | LLM tries to load `table_data.csv` | LLM Quality | MED |
28
+ | **10** | MP3 audio (pie recipe) | Unsupported file type | Technical | **MED** |
29
+ | **12** | Python code execution | Unsupported file type | Technical | **LOW** |
30
+ | **13** | MP3 audio (calculus) | Unsupported file type | Technical | **MED** |
31
 
32
  ### ❌ LLM Quality Issues (12/20 - AI can't solve)
33
 
34
+ | # | Task | Answer | Expected | Type |
35
+ | --- | --------------------- | ----------------------- | --------------- | ---------------- |
36
+ | 1 | Calculator | "Unable to answer" | Right | Reasoning |
37
+ | 2 | Wikipedia dinosaur | "Scott Hartman" | FunkMonk | Knowledge |
38
+ | 4 | Mercedes Sosa albums | "Unable to answer" | 3 | Knowledge |
39
+ | 7 | Chess position | "Unable to answer" | Rd5 | Vision+Reasoning |
40
+ | 8 | Grocery list (botany) | Wrong (includes fruits) | 5 items | Knowledge |
41
+ | 11 | Equine veterinarian | "Unable to answer" | Louvrier | Knowledge |
42
+ | 14 | NASA award | "Unable to answer" | 80GSFC21M0002 | Knowledge |
43
+ | 16 | Yankee at-bats | "Unable to answer" | 519 | Knowledge |
44
+ | 17 | Pitcher numbers | "Unable to answer" | Yoshida, Uehara | Knowledge |
45
+ | 18 | Olympics athletes | "Unable to answer" | CUB | Knowledge |
46
+ | 19 | Malko Competition | "Unable to answer" | Claus | Knowledge |
47
+ | 20 | Excel sales | "12096.00" | "89706.00" | Calculation |
48
 
49
  ## Strategy
50
 
51
  **Priority 1: Fix System Errors** (unlock 6 questions)
52
+
53
  - YouTube videos (2 questions) - HIGH impact
54
  - MP3 audio (2 questions) - Medium impact
55
  - Python execution (1 question) - Low impact
56
  - CSV table - LLM issue, not technical
57
 
58
  **Priority 2: Improve LLM Quality** (address "Unable to answer" cases)
59
+
60
  - Better prompting
61
  - Tool selection improvements
62
  - Reasoning enhancements
 
68
  **Goal:** Fix questions #3 and #5 (YouTube videos)
69
 
70
  **Root Cause:** Vision tool tries to process YouTube URLs directly, but:
71
+
72
  - YouTube videos need to be downloaded first
73
  - Vision tool expects image files, not video URLs
74
  - Need to extract frames or use transcript
 
78
  #### Option A: YouTube Transcript (Recommended)
79
 
80
  **Implementation:**
81
+
82
  ```python
83
  # NEW: src/tools/youtube.py
84
  import youtube_transcript_api
 
94
  ```
95
 
96
  **Pros:**
97
+
98
  - βœ… Works with current LLM (text-based)
99
  - βœ… Simple API (youtube-transcript-api library)
100
  - βœ… Fast, no video download needed
101
  - βœ… Solves both #3 and #5
102
 
103
  **Cons:**
104
+
105
  - ❌ Won't work for visual-only questions (but our questions are about content)
106
  - ❌ Might not capture visual details
107
 
 
110
  #### Option B: Video Frame Extraction
111
 
112
  **Implementation:**
113
+
114
  - Download video (yt-dlp)
115
  - Extract key frames (OpenCV)
116
  - Pass frames to vision tool
 
163
  **Solution:** Add audio transcription tool
164
 
165
  **Implementation:**
166
+
167
  ```python
168
  # NEW: src/tools/audio.py
169
  import whisper
 
176
  ```
177
 
178
  **Alternative:** HuggingFace audio models (free)
179
+
180
  - `openai/whisper-base`
181
  - Use via Inference API
182
 
 
200
  **Security Concern:** ⚠️ **DANGEROUS** - executing arbitrary Python code
201
 
202
  **Options:**
203
+
204
  1. **Restricted execution** - Only allow specific operations
205
  2. **Docker container** - Isolate execution
206
  3. **Skip for now** - Defer due to security concerns
 
220
  **Solution:** This is NOT technical - LLM needs better prompts or tool selection
221
 
222
  **Approaches:**
223
+
224
  1. Improve system prompt to recognize data in questions
225
  2. Add hint in question preprocessing
226
  3. Special handling for markdown tables in questions
 
242
  **Vision+Reasoning (1 question):** #7
243
 
244
  **Approaches:**
245
+
246
  1. **Better prompts** - Emphasize exact answer format
247
  2. **Tool selection hints** - Guide LLM to use appropriate tools
248
  3. **Few-shot examples** - Show LLM expected answer format
249
  4. **Chain-of-thought** - Encourage step-by-step reasoning
250
 
251
  **Implementation:**
252
+
253
  - Update `synthesize_answer()` prompt
254
  - Add answer format examples to system prompt
255
  - Improve tool descriptions for better selection
 
259
  ## Success Criteria
260
 
261
  ### Phase 1: YouTube Support
262
+
263
  - [ ] YouTube transcript tool implemented
264
  - [ ] Question #3 answered correctly (bird species = "3")
265
  - [ ] Question #5 answered correctly (Teal'c quote = "Extremely")
266
  - [ ] **Score: 10% β†’ 40% (4/20)** βœ… TARGET REACHED
267
 
268
  ### Phase 2: MP3 Support
269
+
270
  - [ ] Audio transcription tool implemented
271
  - [ ] Question #10 answered correctly (pie ingredients)
272
  - [ ] Question #13 answered correctly (page numbers)
273
  - [ ] **Score: 40% β†’ 50% (10/20)** βœ… EXCEEDS TARGET
274
 
275
  ### Phase 3: Python Execution
276
+
277
  - [ ] Code execution tool implemented (sandboxed)
278
  - [ ] Question #12 answered correctly (output = "0")
279
  - [ ] **Score: 50% β†’ 55% (11/20)**
280
 
281
  ### Phase 4: CSV Table
282
+
283
  - [ ] LLM recognizes data in question
284
  - [ ] Question #6 answered correctly ("b, e")
285
  - [ ] **Score: 55% β†’ 60% (12/20)**
286
 
287
  ### Phase 5: LLM Quality
288
+
289
  - [ ] "Unable to answer" reduced by 50%
290
  - [ ] At least 3 more knowledge questions correct
291
  - [ ] **Score: 60% β†’ 75%+ (15/20)**
 
293
  ## Files to Modify
294
 
295
  ### Phase 1: YouTube
296
+
297
  1. **requirements.txt** - Add `youtube-transcript-api`
298
  2. **src/tools/youtube.py** (NEW) - YouTube transcript extraction
299
+ 3. **src/tools/**init**.py** - Register youtube_transcript tool
300
 
301
  ### Phase 2: MP3 Audio
302
+
303
  1. **requirements.txt** - Add `openai-whisper` or HF audio
304
  2. **src/tools/audio.py** (NEW) - Audio transcription
305
+ 3. **src/tools/**init**.py** - Register transcribe_audio tool
306
 
307
  ### Phase 3-5: LLM Quality
308
+
309
  1. **src/agent/graph.py** - Update prompts
310
+ 2. **src/tools/**init**.py** - Improve tool descriptions
311
 
312
  ## Removed (Not Relevant)
313
 
 
322
  ## Decision Gates
323
 
324
  **Gate 1 (YouTube):** Does transcript solve both video questions?
325
+
326
  - **YES:** 40% score, proceed to Phase 2
327
  - **NO:** Try frame extraction approach
328
 
329
  **Gate 2 (MP3):** Does transcription solve both audio questions?
330
+
331
  - **YES:** 50% score, proceed to Phase 3
332
  - **NO:** Try different audio model
333
 
334
  **Gate 3 (Target):** Have we reached 30% (6/20)?
335
+
336
  - **YES:** βœ… SUCCESS - course target met
337
  - **NO:** Continue to Phase 4-5
338
 
 
354
  ## Backup Options
355
 
356
  If YouTube transcript doesn't work:
357
+
358
  - **Plan B:** Extract video frames, analyze with vision tool
359
  - **Plan C:** Skip video questions, focus on other fixes
360
 
361
  If MP3 transcription doesn't work:
362
+
363
  - **Plan B:** Use HuggingFace audio models
364
  - **Plan C:** Skip audio questions, focus on LLM quality
{_template_original β†’ project_template_original}/README.md RENAMED
File without changes
{_template_original β†’ project_template_original}/app.py RENAMED
File without changes
{_template_original β†’ project_template_original}/requirements.txt RENAMED
File without changes