narcolepticchicken commited on
Commit
57a8c02
·
verified ·
1 Parent(s): 522d111

Upload reports/real_llm_debug_log.md

Browse files
Files changed (1) hide show
  1. reports/real_llm_debug_log.md +45 -0
reports/real_llm_debug_log.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Real LLM Benchmark Debugging Log
2
+
3
+ ## Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B
4
+
5
+ We ran 8 versions of the real LLM code benchmark. Here's what we learned.
6
+
7
+ ### V1–V3: Naive approach
8
+ - Prompt: HumanEval prompt + generate completion
9
+ - Extraction: take generated tokens after prompt
10
+ - Result: **0/20 pass**
11
+ - Reason: Model outputs complete function (including `def <entry_point>`), but we prepended HumanEval prompt which also has `def <entry_point>` → duplicate definitions → syntax error
12
+
13
+ ### V4: Chat template + body extraction attempt
14
+ - Added chat template (system/assistant)
15
+ - Attempted to extract just function body after `def` line
16
+ - Result: **0/20 pass**
17
+ - Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings
18
+
19
+ ### V5–V6: Multiple extraction strategies + debug
20
+ - Added multiple candidates: body-only, stripped, raw
21
+ - Used `ast.parse()` validation
22
+ - Result: **0/20 pass**
23
+ - Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.
24
+
25
+ ### V7: Regex-based extraction + larger model
26
+ - Regex ````\n(.*?)\n```` for code block extraction
27
+ - Larger model: Qwen 1.5B (vs 0.5B)
28
+ - 512 tokens
29
+ - Result: **0/20 pass** (from partial logs)
30
+ - Critical finding: Error changed! `TypeError: check() missing 1 required positional argument: 'candidate'`
31
+ - Root cause: **evalplus tests ALREADY contain `check(candidate)` calls** — we were appending `check()` without args!
32
+
33
+ ### V8: THE FIX
34
+ - **Do NOT append `check()`** — evalplus test code already has it
35
+ - Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
36
+ - Still using regex-based markdown extraction
37
+ - Model: Qwen 1.5B
38
+ - **Status: Submitted on a10g-small, awaiting results**
39
+
40
+ ## Key Lessons
41
+
42
+ 1. **Always inspect the test file format.** evalplus has `check(candidate)` built in — different from standard HumanEval.
43
+ 2. **Markdown extraction must be robust.** Models output `\n\`\`\`python\n...\n\`\`\``. Use regex, not simple string splitting.
44
+ 3. **Model size matters.** 0.5B may not reliably generate valid Python; 1.5B+ recommended.
45
+ 4. **Debug by printing the ACTUAL test file.** We should have done this in V1.