# Stack 2.9 Evaluation Datasets ## Files | File | Count | Description | |------|-------|-------------| | `humaneval_50.jsonl` | 50 | HumanEval subset with difficulty ratings | | `mbpp_100.jsonl` | 100 | MBPP-style programming problems | | `tool_scenarios_50.jsonl` | 50 | Multi-step tool calling scenarios | ## Format ### HumanEval ```json {"task_id": "humaneval_1", "difficulty": "medium", "prompt": "def solution(x):\n", "test": "assert solution(5) == 5"} ``` ### MBPP ```json {"task_id": "mbpp_1", "difficulty": "easy", "prompt": "def task(arr):\n", "test": "assert task([1,2,3]) == 6"} ``` ### Tool Scenarios ```json {"task_id": "tool_scenario_1", "difficulty": "hard", "prompt": "Task: Read file and count errors", "tools_needed": ["FileRead", "Grep"], "expected_steps": 3} ``` ## Usage ```python from evaluate_model import load_benchmark benchmarks = load_benchmark("evaluation/humaneval_50.jsonl") # Run evaluation... ```