Real LLM Benchmark Debugging Log
Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B
We ran 8 versions of the real LLM code benchmark. Here's what we learned.
V1–V3: Naive approach
- Prompt: HumanEval prompt + generate completion
- Extraction: take generated tokens after prompt
- Result: 0/20 pass
- Reason: Model outputs complete function (including
def <entry_point>), but we prepended HumanEval prompt which also hasdef <entry_point>→ duplicate definitions → syntax error
V4: Chat template + body extraction attempt
- Added chat template (system/assistant)
- Attempted to extract just function body after
defline - Result: 0/20 pass
- Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings
V5–V6: Multiple extraction strategies + debug
- Added multiple candidates: body-only, stripped, raw
- Used
ast.parse()validation - Result: 0/20 pass
- Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.
V7: Regex-based extraction + larger model
- Regex
\n(.*?)\nfor code block extraction - Larger model: Qwen 1.5B (vs 0.5B)
- 512 tokens
- Result: 0/20 pass (from partial logs)
- Critical finding: Error changed!
TypeError: check() missing 1 required positional argument: 'candidate' - Root cause: evalplus tests ALREADY contain
check(candidate)calls — we were appendingcheck()without args!
V8: THE FIX
- Do NOT append
check()— evalplus test code already has it - Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
- Still using regex-based markdown extraction
- Model: Qwen 1.5B
- Status: Submitted on a10g-small, awaiting results
Key Lessons
- Always inspect the test file format. evalplus has
check(candidate)built in — different from standard HumanEval. - Markdown extraction must be robust. Models output
\n\``python\n...\n````. Use regex, not simple string splitting. - Model size matters. 0.5B may not reliably generate valid Python; 1.5B+ recommended.
- Debug by printing the ACTUAL test file. We should have done this in V1.