Upload reports/real_llm_debug_log.md
Browse files
reports/real_llm_debug_log.md
ADDED
|
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Real LLM Benchmark Debugging Log
|
| 2 |
+
|
| 3 |
+
## Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B
|
| 4 |
+
|
| 5 |
+
We ran 8 versions of the real LLM code benchmark. Here's what we learned.
|
| 6 |
+
|
| 7 |
+
### V1–V3: Naive approach
|
| 8 |
+
- Prompt: HumanEval prompt + generate completion
|
| 9 |
+
- Extraction: take generated tokens after prompt
|
| 10 |
+
- Result: **0/20 pass**
|
| 11 |
+
- Reason: Model outputs complete function (including `def <entry_point>`), but we prepended HumanEval prompt which also has `def <entry_point>` → duplicate definitions → syntax error
|
| 12 |
+
|
| 13 |
+
### V4: Chat template + body extraction attempt
|
| 14 |
+
- Added chat template (system/assistant)
|
| 15 |
+
- Attempted to extract just function body after `def` line
|
| 16 |
+
- Result: **0/20 pass**
|
| 17 |
+
- Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings
|
| 18 |
+
|
| 19 |
+
### V5–V6: Multiple extraction strategies + debug
|
| 20 |
+
- Added multiple candidates: body-only, stripped, raw
|
| 21 |
+
- Used `ast.parse()` validation
|
| 22 |
+
- Result: **0/20 pass**
|
| 23 |
+
- Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.
|
| 24 |
+
|
| 25 |
+
### V7: Regex-based extraction + larger model
|
| 26 |
+
- Regex ````\n(.*?)\n```` for code block extraction
|
| 27 |
+
- Larger model: Qwen 1.5B (vs 0.5B)
|
| 28 |
+
- 512 tokens
|
| 29 |
+
- Result: **0/20 pass** (from partial logs)
|
| 30 |
+
- Critical finding: Error changed! `TypeError: check() missing 1 required positional argument: 'candidate'`
|
| 31 |
+
- Root cause: **evalplus tests ALREADY contain `check(candidate)` calls** — we were appending `check()` without args!
|
| 32 |
+
|
| 33 |
+
### V8: THE FIX
|
| 34 |
+
- **Do NOT append `check()`** — evalplus test code already has it
|
| 35 |
+
- Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
|
| 36 |
+
- Still using regex-based markdown extraction
|
| 37 |
+
- Model: Qwen 1.5B
|
| 38 |
+
- **Status: Submitted on a10g-small, awaiting results**
|
| 39 |
+
|
| 40 |
+
## Key Lessons
|
| 41 |
+
|
| 42 |
+
1. **Always inspect the test file format.** evalplus has `check(candidate)` built in — different from standard HumanEval.
|
| 43 |
+
2. **Markdown extraction must be robust.** Models output `\n\`\`\`python\n...\n\`\`\``. Use regex, not simple string splitting.
|
| 44 |
+
3. **Model size matters.** 0.5B may not reliably generate valid Python; 1.5B+ recommended.
|
| 45 |
+
4. **Debug by printing the ACTUAL test file.** We should have done this in V1.
|