File size: 9,572 Bytes
3cdba69 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
# LLM Evaluation Framework
This directory contains tools for evaluating large language models on various benchmarks.
## Overview
The evaluation framework supports multiple benchmark datasets across different domains:
- **Math**: AIME24, AIME25 (evaluation scripts provided)
- **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
- **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
- **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits)
- **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit)
## Installation
Install required dependencies:
```bash
pip install transformers vllm torch tqdm pandas
```
## Directory Structure
```
evaluation/
βββ inference.py # Main inference script
βββ arguments.py # Command-line argument definitions
β
βββ data/ # Benchmark datasets and preprocessing
β βββ benchmark.py # Dataset preprocessing functions
β βββ aime24/, aime25/ # AIME competition problems
β βββ gpqa/ # GPQA dataset
β βββ livecodebench/ # LiveCodeBench v5 and v6
β βββ mmlu/, mmlu_pro/ # MMLU variants
β βββ arena-hard-v0.1/, arena-hard-v2.0/ # Arena-Hard benchmarks
β βββ ifeval/, IFBench/ # Instruction following benchmarks
β βββ mt_bench/ # MT-Bench data
β
βββ eval/ # Evaluation scripts
β βββ get_scores_math.py # Math benchmarks (AIME24, AIME25)
β βββ get_scores_mmlu_batch.py # MMLU, MMLU-Pro evaluation
β βββ get_scores_gpqa.py # GPQA evaluation
β βββ get_scores_code.py # Code benchmarks (LiveCodeBench)
β βββ tools/ # Evaluation utilities
β βββ grader.py # Math answer grading
β βββ code_verifier_utils.py # Code execution and verification
β βββ latex2sympy/ # LaTeX to SymPy conversion
β
βββ run.sh # Example single benchmark run
βββ run_local.sh # Local evaluation script
βββ run_all.sh # Run multiple benchmarks in parallel
βββ README.md # This file
```
## Usage
### Quick Start
1. Edit `run.sh` to configure your model and data paths
2. Run the evaluation:
```bash
bash run.sh
```
### Advanced Usage
Run inference directly with custom parameters:
```bash
python inference.py \
--model-folder /path/to/models \
--model-name your-model \
--tokenizer-folder /path/to/tokenizers \
--tokenizer-name your-tokenizer \
--benchmark-folder /path/to/benchmarks \
--eval-dataset aime24 \
--temperature 0.6 \
--topp 0.95 \
--batch-size 2048
```
We suggest following the paper config and running benchmarks with k different random seeds.
### Key Arguments
#### Model Configuration (Required)
- `--model-folder`: Directory containing model weights
- `--model-name`: Name of the model subdirectory
- `--tokenizer-folder`: Directory containing tokenizer files
- `--tokenizer-name`: Name of the tokenizer subdirectory
#### Dataset Selection (Required for evaluation)
- `--benchmark-folder`: Root directory containing all benchmark datasets
- `--eval-dataset`: Name of the evaluation dataset (see supported datasets above)
#### Inference Parameters (Optional)
- `--temperature`: Sampling temperature (default: 0 for greedy decoding)
- `--topp`: Top-p (nucleus) sampling threshold (default: 1.0)
- `--topk`: Top-k sampling threshold (default: 1)
- `--max-output-len`: Maximum output length in tokens (default: 2048)
- `--batch-size`: Batch size for inference (default: 16)
- `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1)
#### Dataset Subsetting (Optional)
- `--start-idx`: Starting index for dataset subsetting (default: -1, disabled)
- `--end-idx`: Ending index for dataset subsetting (default: -1, disabled)
#### Other Options
- `--seed`: Random seed for reproducibility (default: 42)
- `--no-think`: Disable thinking mode (flag, thinking enabled by default)
- `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1)
- `--device-id`: Comma-separated GPU device IDs (optional)
- `--model-output-path`: Path to first turn output (required for mtbench_secondturn only)
## Supported Datasets
- `aime24` / `aime25`: AIME competition problems
- `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6)
- `mmlu`: MMLU 5-shot evaluation
- `mmlu_pro`: MMLU Pro dataset
- `gpqa_diamond`: GPQA Diamond subset
- `ifeval`: IFEval instruction following
- `ifbench`: IFBench instruction following
- `arena_hard`: Arena-Hard v0.1
## Running Evaluation Scripts
After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory.
We also attach our cached generation files in the corresponding model repo for reproducibility.
### Math Benchmarks (AIME24, AIME25)
Evaluate math problem-solving performance:
```bash
cd eval
python get_scores_math.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
```
This script:
- Evaluates AIME24 and AIME25 benchmarks
- Extracts answers from `\boxed{}` and other formats
- Computes accuracy with mathematical equivalence checking
- Reports mean accuracy and standard deviation across multiple runs
### Multiple Choice (MMLU, MMLU-Pro, GPQA)
Evaluate MMLU and variants:
```bash
cd eval
python get_scores_mmlu_batch.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks \
--verbose # Optional: print per-category accuracy
```
This script evaluates:
- **MMLU**: Standard MMLU with 4 choices (A-D)
- **MMLU-Pro**: Extended version with up to 16 choices (A-P)
Features:
- Supports boxed answer format (e.g., `\boxed{A}`)
- Extracts letter choices from various formats (parentheses, text, etc.)
- Handles batch-split output files automatically
- Computes accuracy across all MMLU variants
- Optional per-category breakdown with `--verbose` flag
Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:
```bash
cd eval
python get_scores_gpqa.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
```
This script:
- Evaluates GPQA Diamond subset
- Extracts answers from boxed and text formats
- Uses mathematical equivalence checking for complex answers
- Reports accuracy with standard deviation
### Code Generation (LiveCodeBench)
Evaluate code generation performance:
```bash
cd eval
python get_scores_code.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
```
This script:
- Evaluates LiveCodeBench v5 and v6
- Executes generated code against test cases
- Computes pass rate (percentage of problems solved correctly)
- Reports finish rate (percentage of valid code generations)
**Note**: Code execution requires:
```bash
pip install numpy tqdm
```
### Other Benchmarks
For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:
- **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto)
- **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval)
- **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench)
These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.
## Output Format
Results are saved as JSONL files in:
```
{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl
```
Each line contains:
- `task_id` or `question_id`: Unique identifier for the question
- `output`: Model's generated response
- `reason`: Whether reasoning was used (boolean)
- `reason_text`: The reasoning/thinking content (if applicable)
- Additional dataset-specific fields
## Adding New Datasets
To add a new dataset:
1. Add a preprocessing function in `data/benchmark.py`:
```python
def preprocess_your_dataset(data_file):
"""Preprocess your dataset.
Args:
data_file: Path to dataset file
Returns:
tuple: (prompt_list, qid_list) or just prompt_list
"""
# Your preprocessing logic
pass
```
2. Add the dataset path argument in `arguments.py`:
```python
group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')
```
3. Add the dataset case in `inference.py` in the `get_prompt_list()` function:
```python
elif args.eval_dataset == "your_dataset":
from data.benchmark import preprocess_your_dataset
input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
prompt_list, qid_list = preprocess_your_dataset(input_datapath)
```
## Notes
- The framework uses vLLM for efficient inference with batching and tensor parallelism support
- Special handling is provided for models like DeepSeek-R1 that require eager mode
- Thinking mode (`<think>` tags) is supported for models trained with reasoning capabilities
- YaRN RoPE scaling is supported for extended context lengths
## License
See the main repository LICENSE file for licensing information.
|