| # Real LLM Benchmark Debugging Log |
|
|
| ## Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B |
|
|
| We ran 8 versions of the real LLM code benchmark. Here's what we learned. |
|
|
| ### V1βV3: Naive approach |
| - Prompt: HumanEval prompt + generate completion |
| - Extraction: take generated tokens after prompt |
| - Result: **0/20 pass** |
| - Reason: Model outputs complete function (including `def <entry_point>`), but we prepended HumanEval prompt which also has `def <entry_point>` β duplicate definitions β syntax error |
|
|
| ### V4: Chat template + body extraction attempt |
| - Added chat template (system/assistant) |
| - Attempted to extract just function body after `def` line |
| - Result: **0/20 pass** |
| - Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings |
|
|
| ### V5βV6: Multiple extraction strategies + debug |
| - Added multiple candidates: body-only, stripped, raw |
| - Used `ast.parse()` validation |
| - Result: **0/20 pass** |
| - Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped. |
|
|
| ### V7: Regex-based extraction + larger model |
| - Regex ````\n(.*?)\n```` for code block extraction |
| - Larger model: Qwen 1.5B (vs 0.5B) |
| - 512 tokens |
| - Result: **0/20 pass** (from partial logs) |
| - Critical finding: Error changed! `TypeError: check() missing 1 required positional argument: 'candidate'` |
| - Root cause: **evalplus tests ALREADY contain `check(candidate)` calls** β we were appending `check()` without args! |
|
|
| ### V8: THE FIX |
| - **Do NOT append `check()`** β evalplus test code already has it |
| - Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text." |
| - Still using regex-based markdown extraction |
| - Model: Qwen 1.5B |
| - **Status: Submitted on a10g-small, awaiting results** |
|
|
| ## Key Lessons |
|
|
| 1. **Always inspect the test file format.** evalplus has `check(candidate)` built in β different from standard HumanEval. |
| 2. **Markdown extraction must be robust.** Models output `\n\`\`\`python\n...\n\`\`\``. Use regex, not simple string splitting. |
| 3. **Model size matters.** 0.5B may not reliably generate valid Python; 1.5B+ recommended. |
| 4. **Debug by printing the ACTUAL test file.** We should have done this in V1. |
|
|