walidsobhie-code
feat: add evaluation datasets (HumanEval 50, MBPP 100, Tool scenarios 50)
20a06fb

Stack 2.9 Evaluation Datasets

Files

File Count Description
humaneval_50.jsonl 50 HumanEval subset with difficulty ratings
mbpp_100.jsonl 100 MBPP-style programming problems
tool_scenarios_50.jsonl 50 Multi-step tool calling scenarios

Format

HumanEval

{"task_id": "humaneval_1", "difficulty": "medium", "prompt": "def solution(x):\n", "test": "assert solution(5) == 5"}

MBPP

{"task_id": "mbpp_1", "difficulty": "easy", "prompt": "def task(arr):\n", "test": "assert task([1,2,3]) == 6"}

Tool Scenarios

{"task_id": "tool_scenario_1", "difficulty": "hard", "prompt": "Task: Read file and count errors", "tools_needed": ["FileRead", "Grep"], "expected_steps": 3}

Usage

from evaluate_model import load_benchmark

benchmarks = load_benchmark("evaluation/humaneval_50.jsonl")
# Run evaluation...