File size: 938 Bytes
20a06fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Stack 2.9 Evaluation Datasets

## Files

| File | Count | Description |
|------|-------|-------------|
| `humaneval_50.jsonl` | 50 | HumanEval subset with difficulty ratings |
| `mbpp_100.jsonl` | 100 | MBPP-style programming problems |
| `tool_scenarios_50.jsonl` | 50 | Multi-step tool calling scenarios |

## Format

### HumanEval
```json
{"task_id": "humaneval_1", "difficulty": "medium", "prompt": "def solution(x):\n", "test": "assert solution(5) == 5"}
```

### MBPP
```json
{"task_id": "mbpp_1", "difficulty": "easy", "prompt": "def task(arr):\n", "test": "assert task([1,2,3]) == 6"}
```

### Tool Scenarios
```json
{"task_id": "tool_scenario_1", "difficulty": "hard", "prompt": "Task: Read file and count errors", "tools_needed": ["FileRead", "Grep"], "expected_steps": 3}
```

## Usage

```python
from evaluate_model import load_benchmark

benchmarks = load_benchmark("evaluation/humaneval_50.jsonl")
# Run evaluation...
```