File size: 2,221 Bytes
57a8c02 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | # Real LLM Benchmark Debugging Log
## Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B
We ran 8 versions of the real LLM code benchmark. Here's what we learned.
### V1–V3: Naive approach
- Prompt: HumanEval prompt + generate completion
- Extraction: take generated tokens after prompt
- Result: **0/20 pass**
- Reason: Model outputs complete function (including `def <entry_point>`), but we prepended HumanEval prompt which also has `def <entry_point>` → duplicate definitions → syntax error
### V4: Chat template + body extraction attempt
- Added chat template (system/assistant)
- Attempted to extract just function body after `def` line
- Result: **0/20 pass**
- Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings
### V5–V6: Multiple extraction strategies + debug
- Added multiple candidates: body-only, stripped, raw
- Used `ast.parse()` validation
- Result: **0/20 pass**
- Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.
### V7: Regex-based extraction + larger model
- Regex ````\n(.*?)\n```` for code block extraction
- Larger model: Qwen 1.5B (vs 0.5B)
- 512 tokens
- Result: **0/20 pass** (from partial logs)
- Critical finding: Error changed! `TypeError: check() missing 1 required positional argument: 'candidate'`
- Root cause: **evalplus tests ALREADY contain `check(candidate)` calls** — we were appending `check()` without args!
### V8: THE FIX
- **Do NOT append `check()`** — evalplus test code already has it
- Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
- Still using regex-based markdown extraction
- Model: Qwen 1.5B
- **Status: Submitted on a10g-small, awaiting results**
## Key Lessons
1. **Always inspect the test file format.** evalplus has `check(candidate)` built in — different from standard HumanEval.
2. **Markdown extraction must be robust.** Models output `\n\`\`\`python\n...\n\`\`\``. Use regex, not simple string splitting.
3. **Model size matters.** 0.5B may not reliably generate valid Python; 1.5B+ recommended.
4. **Debug by printing the ACTUAL test file.** We should have done this in V1.
|