occ-stack / reports /real_llm_debug_log.md
narcolepticchicken's picture
Upload reports/real_llm_debug_log.md
57a8c02 verified

Real LLM Benchmark Debugging Log

Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B

We ran 8 versions of the real LLM code benchmark. Here's what we learned.

V1–V3: Naive approach

  • Prompt: HumanEval prompt + generate completion
  • Extraction: take generated tokens after prompt
  • Result: 0/20 pass
  • Reason: Model outputs complete function (including def <entry_point>), but we prepended HumanEval prompt which also has def <entry_point> → duplicate definitions → syntax error

V4: Chat template + body extraction attempt

  • Added chat template (system/assistant)
  • Attempted to extract just function body after def line
  • Result: 0/20 pass
  • Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings

V5–V6: Multiple extraction strategies + debug

  • Added multiple candidates: body-only, stripped, raw
  • Used ast.parse() validation
  • Result: 0/20 pass
  • Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.

V7: Regex-based extraction + larger model

  • Regex \n(.*?)\n for code block extraction
  • Larger model: Qwen 1.5B (vs 0.5B)
  • 512 tokens
  • Result: 0/20 pass (from partial logs)
  • Critical finding: Error changed! TypeError: check() missing 1 required positional argument: 'candidate'
  • Root cause: evalplus tests ALREADY contain check(candidate) calls — we were appending check() without args!

V8: THE FIX

  • Do NOT append check() — evalplus test code already has it
  • Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
  • Still using regex-based markdown extraction
  • Model: Qwen 1.5B
  • Status: Submitted on a10g-small, awaiting results

Key Lessons

  1. Always inspect the test file format. evalplus has check(candidate) built in — different from standard HumanEval.
  2. Markdown extraction must be robust. Models output \n\``python\n...\n````. Use regex, not simple string splitting.
  3. Model size matters. 0.5B may not reliably generate valid Python; 1.5B+ recommended.
  4. Debug by printing the ACTUAL test file. We should have done this in V1.