| # LLM Evaluation Framework | |
| This directory contains tools for evaluating large language models on various benchmarks. | |
| ## Overview | |
| The evaluation framework supports multiple benchmark datasets across different domains: | |
| - **Math**: AIME24, AIME25 (evaluation scripts provided) | |
| - **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided) | |
| - **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided) | |
| - **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits) | |
| - **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit) | |
| ## Installation | |
| Install required dependencies: | |
| ```bash | |
| pip install transformers vllm torch tqdm pandas | |
| ``` | |
| ## Directory Structure | |
| ``` | |
| evaluation/ | |
| ├── inference.py # Main inference script | |
| ├── arguments.py # Command-line argument definitions | |
| │ | |
| ├── data/ # Benchmark datasets and preprocessing | |
| │ ├── benchmark.py # Dataset preprocessing functions | |
| │ ├── aime24/, aime25/ # AIME competition problems | |
| │ ├── gpqa/ # GPQA dataset | |
| │ ├── livecodebench/ # LiveCodeBench v5 and v6 | |
| │ ├── mmlu/, mmlu_pro/ # MMLU variants | |
| │ ├── arena-hard-v0.1/, arena-hard-v2.0/ # Arena-Hard benchmarks | |
| │ ├── ifeval/, IFBench/ # Instruction following benchmarks | |
| │ └── mt_bench/ # MT-Bench data | |
| │ | |
| ├── eval/ # Evaluation scripts | |
| │ ├── get_scores_math.py # Math benchmarks (AIME24, AIME25) | |
| │ ├── get_scores_mmlu_batch.py # MMLU, MMLU-Pro evaluation | |
| │ ├── get_scores_gpqa.py # GPQA evaluation | |
| │ ├── get_scores_code.py # Code benchmarks (LiveCodeBench) | |
| │ └── tools/ # Evaluation utilities | |
| │ ├── grader.py # Math answer grading | |
| │ ├── code_verifier_utils.py # Code execution and verification | |
| │ └── latex2sympy/ # LaTeX to SymPy conversion | |
| │ | |
| ├── run.sh # Example single benchmark run | |
| ├── run_local.sh # Local evaluation script | |
| ├── run_all.sh # Run multiple benchmarks in parallel | |
| └── README.md # This file | |
| ``` | |
| ## Usage | |
| ### Quick Start | |
| 1. Edit `run.sh` to configure your model and data paths | |
| 2. Run the evaluation: | |
| ```bash | |
| bash run.sh | |
| ``` | |
| ### Advanced Usage | |
| Run inference directly with custom parameters: | |
| ```bash | |
| python inference.py \ | |
| --model-folder /path/to/models \ | |
| --model-name your-model \ | |
| --tokenizer-folder /path/to/tokenizers \ | |
| --tokenizer-name your-tokenizer \ | |
| --benchmark-folder /path/to/benchmarks \ | |
| --eval-dataset aime24 \ | |
| --temperature 0.6 \ | |
| --topp 0.95 \ | |
| --batch-size 2048 | |
| ``` | |
| We suggest following the paper config and running benchmarks with k different random seeds. | |
| ### Key Arguments | |
| #### Model Configuration (Required) | |
| - `--model-folder`: Directory containing model weights | |
| - `--model-name`: Name of the model subdirectory | |
| - `--tokenizer-folder`: Directory containing tokenizer files | |
| - `--tokenizer-name`: Name of the tokenizer subdirectory | |
| #### Dataset Selection (Required for evaluation) | |
| - `--benchmark-folder`: Root directory containing all benchmark datasets | |
| - `--eval-dataset`: Name of the evaluation dataset (see supported datasets above) | |
| #### Inference Parameters (Optional) | |
| - `--temperature`: Sampling temperature (default: 0 for greedy decoding) | |
| - `--topp`: Top-p (nucleus) sampling threshold (default: 1.0) | |
| - `--topk`: Top-k sampling threshold (default: 1) | |
| - `--max-output-len`: Maximum output length in tokens (default: 2048) | |
| - `--batch-size`: Batch size for inference (default: 16) | |
| - `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1) | |
| #### Dataset Subsetting (Optional) | |
| - `--start-idx`: Starting index for dataset subsetting (default: -1, disabled) | |
| - `--end-idx`: Ending index for dataset subsetting (default: -1, disabled) | |
| #### Other Options | |
| - `--seed`: Random seed for reproducibility (default: 42) | |
| - `--no-think`: Disable thinking mode (flag, thinking enabled by default) | |
| - `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1) | |
| - `--device-id`: Comma-separated GPU device IDs (optional) | |
| - `--model-output-path`: Path to first turn output (required for mtbench_secondturn only) | |
| ## Supported Datasets | |
| - `aime24` / `aime25`: AIME competition problems | |
| - `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6) | |
| - `mmlu`: MMLU 5-shot evaluation | |
| - `mmlu_pro`: MMLU Pro dataset | |
| - `gpqa_diamond`: GPQA Diamond subset | |
| - `ifeval`: IFEval instruction following | |
| - `ifbench`: IFBench instruction following | |
| - `arena_hard`: Arena-Hard v0.1 | |
| ## Running Evaluation Scripts | |
| After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory. | |
| We also attach our cached generation files in the corresponding model repo for reproducibility. | |
| ### Math Benchmarks (AIME24, AIME25) | |
| Evaluate math problem-solving performance: | |
| ```bash | |
| cd eval | |
| python get_scores_math.py \ | |
| --modelfolder /path/to/model/outputs \ | |
| --testfolder /path/to/test_benchmarks | |
| ``` | |
| This script: | |
| - Evaluates AIME24 and AIME25 benchmarks | |
| - Extracts answers from `\boxed{}` and other formats | |
| - Computes accuracy with mathematical equivalence checking | |
| - Reports mean accuracy and standard deviation across multiple runs | |
| ### Multiple Choice (MMLU, MMLU-Pro, GPQA) | |
| Evaluate MMLU and variants: | |
| ```bash | |
| cd eval | |
| python get_scores_mmlu_batch.py \ | |
| --modelfolder /path/to/model/outputs \ | |
| --testfolder /path/to/test_benchmarks \ | |
| --verbose # Optional: print per-category accuracy | |
| ``` | |
| This script evaluates: | |
| - **MMLU**: Standard MMLU with 4 choices (A-D) | |
| - **MMLU-Pro**: Extended version with up to 16 choices (A-P) | |
| Features: | |
| - Supports boxed answer format (e.g., `\boxed{A}`) | |
| - Extracts letter choices from various formats (parentheses, text, etc.) | |
| - Handles batch-split output files automatically | |
| - Computes accuracy across all MMLU variants | |
| - Optional per-category breakdown with `--verbose` flag | |
| Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance: | |
| ```bash | |
| cd eval | |
| python get_scores_gpqa.py \ | |
| --modelfolder /path/to/model/outputs \ | |
| --testfolder /path/to/test_benchmarks | |
| ``` | |
| This script: | |
| - Evaluates GPQA Diamond subset | |
| - Extracts answers from boxed and text formats | |
| - Uses mathematical equivalence checking for complex answers | |
| - Reports accuracy with standard deviation | |
| ### Code Generation (LiveCodeBench) | |
| Evaluate code generation performance: | |
| ```bash | |
| cd eval | |
| python get_scores_code.py \ | |
| --modelfolder /path/to/model/outputs \ | |
| --testfolder /path/to/test_benchmarks | |
| ``` | |
| This script: | |
| - Evaluates LiveCodeBench v5 and v6 | |
| - Executes generated code against test cases | |
| - Computes pass rate (percentage of problems solved correctly) | |
| - Reports finish rate (percentage of valid code generations) | |
| **Note**: Code execution requires: | |
| ```bash | |
| pip install numpy tqdm | |
| ``` | |
| ### Other Benchmarks | |
| For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions: | |
| - **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto) | |
| - **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval) | |
| - **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench) | |
| These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code. | |
| ## Output Format | |
| Results are saved as JSONL files in: | |
| ``` | |
| {model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl | |
| ``` | |
| Each line contains: | |
| - `task_id` or `question_id`: Unique identifier for the question | |
| - `output`: Model's generated response | |
| - `reason`: Whether reasoning was used (boolean) | |
| - `reason_text`: The reasoning/thinking content (if applicable) | |
| - Additional dataset-specific fields | |
| ## Adding New Datasets | |
| To add a new dataset: | |
| 1. Add a preprocessing function in `data/benchmark.py`: | |
| ```python | |
| def preprocess_your_dataset(data_file): | |
| """Preprocess your dataset. | |
| Args: | |
| data_file: Path to dataset file | |
| Returns: | |
| tuple: (prompt_list, qid_list) or just prompt_list | |
| """ | |
| # Your preprocessing logic | |
| pass | |
| ``` | |
| 2. Add the dataset path argument in `arguments.py`: | |
| ```python | |
| group.add_argument('--your-dataset-path', type=str, default='path/to/dataset') | |
| ``` | |
| 3. Add the dataset case in `inference.py` in the `get_prompt_list()` function: | |
| ```python | |
| elif args.eval_dataset == "your_dataset": | |
| from data.benchmark import preprocess_your_dataset | |
| input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path) | |
| prompt_list, qid_list = preprocess_your_dataset(input_datapath) | |
| ``` | |
| ## Notes | |
| - The framework uses vLLM for efficient inference with batching and tensor parallelism support | |
| - Special handling is provided for models like DeepSeek-R1 that require eager mode | |
| - Thinking mode (`<think>` tags) is supported for models trained with reasoning capabilities | |
| - YaRN RoPE scaling is supported for extended context lengths | |
| ## License | |
| See the main repository LICENSE file for licensing information. | |