File size: 9,572 Bytes
3cdba69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
# LLM Evaluation Framework

This directory contains tools for evaluating large language models on various benchmarks.

## Overview

The evaluation framework supports multiple benchmark datasets across different domains:

- **Math**: AIME24, AIME25 (evaluation scripts provided)
- **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
- **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
- **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits)
- **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit)

## Installation

Install required dependencies:

```bash
pip install transformers vllm torch tqdm pandas
```

## Directory Structure

```
evaluation/
β”œβ”€β”€ inference.py                    # Main inference script
β”œβ”€β”€ arguments.py                    # Command-line argument definitions
β”‚
β”œβ”€β”€ data/                          # Benchmark datasets and preprocessing
β”‚   β”œβ”€β”€ benchmark.py               # Dataset preprocessing functions
β”‚   β”œβ”€β”€ aime24/, aime25/           # AIME competition problems
β”‚   β”œβ”€β”€ gpqa/                      # GPQA dataset
β”‚   β”œβ”€β”€ livecodebench/             # LiveCodeBench v5 and v6
β”‚   β”œβ”€β”€ mmlu/, mmlu_pro/           # MMLU variants
β”‚   β”œβ”€β”€ arena-hard-v0.1/, arena-hard-v2.0/  # Arena-Hard benchmarks
β”‚   β”œβ”€β”€ ifeval/, IFBench/          # Instruction following benchmarks
β”‚   └── mt_bench/                  # MT-Bench data
β”‚
β”œβ”€β”€ eval/                          # Evaluation scripts
β”‚   β”œβ”€β”€ get_scores_math.py         # Math benchmarks (AIME24, AIME25)
β”‚   β”œβ”€β”€ get_scores_mmlu_batch.py   # MMLU, MMLU-Pro evaluation
β”‚   β”œβ”€β”€ get_scores_gpqa.py         # GPQA evaluation
β”‚   β”œβ”€β”€ get_scores_code.py         # Code benchmarks (LiveCodeBench)
β”‚   └── tools/                     # Evaluation utilities
β”‚       β”œβ”€β”€ grader.py              # Math answer grading
β”‚       β”œβ”€β”€ code_verifier_utils.py # Code execution and verification
β”‚       └── latex2sympy/           # LaTeX to SymPy conversion
β”‚
β”œβ”€β”€ run.sh                         # Example single benchmark run
β”œβ”€β”€ run_local.sh                   # Local evaluation script
β”œβ”€β”€ run_all.sh                     # Run multiple benchmarks in parallel
└── README.md                      # This file
```

## Usage

### Quick Start

1. Edit `run.sh` to configure your model and data paths
2. Run the evaluation:

```bash
bash run.sh
```

### Advanced Usage

Run inference directly with custom parameters:

```bash
python inference.py \
    --model-folder /path/to/models \
    --model-name your-model \
    --tokenizer-folder /path/to/tokenizers \
    --tokenizer-name your-tokenizer \
    --benchmark-folder /path/to/benchmarks \
    --eval-dataset aime24 \
    --temperature 0.6 \
    --topp 0.95 \
    --batch-size 2048
```

We suggest following the paper config and running benchmarks with k different random seeds.

### Key Arguments

#### Model Configuration (Required)
- `--model-folder`: Directory containing model weights
- `--model-name`: Name of the model subdirectory
- `--tokenizer-folder`: Directory containing tokenizer files
- `--tokenizer-name`: Name of the tokenizer subdirectory

#### Dataset Selection (Required for evaluation)
- `--benchmark-folder`: Root directory containing all benchmark datasets
- `--eval-dataset`: Name of the evaluation dataset (see supported datasets above)

#### Inference Parameters (Optional)
- `--temperature`: Sampling temperature (default: 0 for greedy decoding)
- `--topp`: Top-p (nucleus) sampling threshold (default: 1.0)
- `--topk`: Top-k sampling threshold (default: 1)
- `--max-output-len`: Maximum output length in tokens (default: 2048)
- `--batch-size`: Batch size for inference (default: 16)
- `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1)

#### Dataset Subsetting (Optional)
- `--start-idx`: Starting index for dataset subsetting (default: -1, disabled)
- `--end-idx`: Ending index for dataset subsetting (default: -1, disabled)

#### Other Options
- `--seed`: Random seed for reproducibility (default: 42)
- `--no-think`: Disable thinking mode (flag, thinking enabled by default)
- `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1)
- `--device-id`: Comma-separated GPU device IDs (optional)
- `--model-output-path`: Path to first turn output (required for mtbench_secondturn only)

## Supported Datasets

- `aime24` / `aime25`: AIME competition problems
- `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6)
- `mmlu`: MMLU 5-shot evaluation
- `mmlu_pro`: MMLU Pro dataset
- `gpqa_diamond`: GPQA Diamond subset
- `ifeval`: IFEval instruction following
- `ifbench`: IFBench instruction following
- `arena_hard`: Arena-Hard v0.1

## Running Evaluation Scripts

After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory.

We also attach our cached generation files in the corresponding model repo for reproducibility. 

### Math Benchmarks (AIME24, AIME25)

Evaluate math problem-solving performance:

```bash
cd eval
python get_scores_math.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks
```

This script:
- Evaluates AIME24 and AIME25 benchmarks
- Extracts answers from `\boxed{}` and other formats
- Computes accuracy with mathematical equivalence checking
- Reports mean accuracy and standard deviation across multiple runs

### Multiple Choice (MMLU, MMLU-Pro, GPQA)

Evaluate MMLU and variants:

```bash
cd eval
python get_scores_mmlu_batch.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks \
    --verbose  # Optional: print per-category accuracy
```

This script evaluates:
- **MMLU**: Standard MMLU with 4 choices (A-D)
- **MMLU-Pro**: Extended version with up to 16 choices (A-P)

Features:
- Supports boxed answer format (e.g., `\boxed{A}`)
- Extracts letter choices from various formats (parentheses, text, etc.)
- Handles batch-split output files automatically
- Computes accuracy across all MMLU variants
- Optional per-category breakdown with `--verbose` flag

Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:

```bash
cd eval
python get_scores_gpqa.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks
```

This script:
- Evaluates GPQA Diamond subset
- Extracts answers from boxed and text formats
- Uses mathematical equivalence checking for complex answers
- Reports accuracy with standard deviation

### Code Generation (LiveCodeBench)

Evaluate code generation performance:

```bash
cd eval
python get_scores_code.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks
```

This script:
- Evaluates LiveCodeBench v5 and v6
- Executes generated code against test cases
- Computes pass rate (percentage of problems solved correctly)
- Reports finish rate (percentage of valid code generations)

**Note**: Code execution requires:
```bash
pip install numpy tqdm
```

### Other Benchmarks

For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:

- **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto)
- **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval)
- **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench)

These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.

## Output Format

Results are saved as JSONL files in:
```
{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl
```

Each line contains:
- `task_id` or `question_id`: Unique identifier for the question
- `output`: Model's generated response
- `reason`: Whether reasoning was used (boolean)
- `reason_text`: The reasoning/thinking content (if applicable)
- Additional dataset-specific fields

## Adding New Datasets

To add a new dataset:

1. Add a preprocessing function in `data/benchmark.py`:
   ```python
   def preprocess_your_dataset(data_file):
       """Preprocess your dataset.
       
       Args:
           data_file: Path to dataset file
       
       Returns:
           tuple: (prompt_list, qid_list) or just prompt_list
       """
       # Your preprocessing logic
       pass
   ```

2. Add the dataset path argument in `arguments.py`:
   ```python
   group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')
   ```

3. Add the dataset case in `inference.py` in the `get_prompt_list()` function:
   ```python
   elif args.eval_dataset == "your_dataset":
       from data.benchmark import preprocess_your_dataset
       input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
       prompt_list, qid_list = preprocess_your_dataset(input_datapath)
   ```

## Notes

- The framework uses vLLM for efficient inference with batching and tensor parallelism support
- Special handling is provided for models like DeepSeek-R1 that require eager mode
- Thinking mode (`<think>` tags) is supported for models trained with reasoning capabilities
- YaRN RoPE scaling is supported for extended context lengths

## License

See the main repository LICENSE file for licensing information.