narcolepticchicken
/

occ-stack

Model card Files Files and versions

occ-stack / reports /real_llm_debug_log.md

narcolepticchicken's picture

narcolepticchicken

Upload reports/real_llm_debug_log.md

57a8c02 verified 27 days ago

|

history blame contribute delete

2.22 kB

	# Real LLM Benchmark Debugging Log

	## Problem: 0% Pass Rate on HumanEval with Qwen 0.5B/1.5B

	We ran 8 versions of the real LLM code benchmark. Here's what we learned.

	### V1–V3: Naive approach
	- Prompt: HumanEval prompt + generate completion
	- Extraction: take generated tokens after prompt
	- Result: 0/20 pass
	- Reason: Model outputs complete function (including `def <entry_point>`), but we prepended HumanEval prompt which also has `def <entry_point>` → duplicate definitions → syntax error

	### V4: Chat template + body extraction attempt
	- Added chat template (system/assistant)
	- Attempted to extract just function body after `def` line
	- Result: 0/20 pass
	- Reason: Markdown fences still present; AST parsing too strict; body extraction failed on docstrings

	### V5–V6: Multiple extraction strategies + debug
	- Added multiple candidates: body-only, stripped, raw
	- Used `ast.parse()` validation
	- Result: 0/20 pass
	- Reason: Still prepending prompt even when model output contains full function. Markdown fences not fully stripped.

	### V7: Regex-based extraction + larger model
	- Regex ````\n(.*?)\n```` for code block extraction
	- Larger model: Qwen 1.5B (vs 0.5B)
	- 512 tokens
	- Result: 0/20 pass (from partial logs)
	- Critical finding: Error changed! `TypeError: check() missing 1 required positional argument: 'candidate'`
	- Root cause: evalplus tests ALREADY contain `check(candidate)` calls — we were appending `check()` without args!

	### V8: THE FIX
	- Do NOT append `check()` — evalplus test code already has it
	- Changed system prompt to: "Write ONLY the function definition. No markdown, no extra text."
	- Still using regex-based markdown extraction
	- Model: Qwen 1.5B
	- Status: Submitted on a10g-small, awaiting results

	## Key Lessons

	1. Always inspect the test file format. evalplus has `check(candidate)` built in — different from standard HumanEval.
	2. Markdown extraction must be robust. Models output `\n\`\`\`python\n...\n\`\`\``. Use regex, not simple string splitting.
	3. Model size matters. 0.5B may not reliably generate valid Python; 1.5B+ recommended.
	4. Debug by printing the ACTUAL test file. We should have done this in V1.