# LLM Evaluation Framework This directory contains tools for evaluating large language models on various benchmarks. ## Overview The evaluation framework supports multiple benchmark datasets across different domains: - **Math**: AIME24, AIME25 (evaluation scripts provided) - **Coding**: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided) - **Multiple Choice**: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided) - **Instruction Following**: IFEval, IFBench (refer to official evaluation toolkits) - **General Helpfulness**: Arena-Hard (refer to official evaluation toolkit) ## Installation Install required dependencies: ```bash pip install transformers vllm torch tqdm pandas ``` ## Directory Structure ``` evaluation/ ├── inference.py # Main inference script ├── arguments.py # Command-line argument definitions │ ├── data/ # Benchmark datasets and preprocessing │ ├── benchmark.py # Dataset preprocessing functions │ ├── aime24/, aime25/ # AIME competition problems │ ├── gpqa/ # GPQA dataset │ ├── livecodebench/ # LiveCodeBench v5 and v6 │ ├── mmlu/, mmlu_pro/ # MMLU variants │ ├── arena-hard-v0.1/, arena-hard-v2.0/ # Arena-Hard benchmarks │ ├── ifeval/, IFBench/ # Instruction following benchmarks │ └── mt_bench/ # MT-Bench data │ ├── eval/ # Evaluation scripts │ ├── get_scores_math.py # Math benchmarks (AIME24, AIME25) │ ├── get_scores_mmlu_batch.py # MMLU, MMLU-Pro evaluation │ ├── get_scores_gpqa.py # GPQA evaluation │ ├── get_scores_code.py # Code benchmarks (LiveCodeBench) │ └── tools/ # Evaluation utilities │ ├── grader.py # Math answer grading │ ├── code_verifier_utils.py # Code execution and verification │ └── latex2sympy/ # LaTeX to SymPy conversion │ ├── run.sh # Example single benchmark run ├── run_local.sh # Local evaluation script ├── run_all.sh # Run multiple benchmarks in parallel └── README.md # This file ``` ## Usage ### Quick Start 1. Edit `run.sh` to configure your model and data paths 2. Run the evaluation: ```bash bash run.sh ``` ### Advanced Usage Run inference directly with custom parameters: ```bash python inference.py \ --model-folder /path/to/models \ --model-name your-model \ --tokenizer-folder /path/to/tokenizers \ --tokenizer-name your-tokenizer \ --benchmark-folder /path/to/benchmarks \ --eval-dataset aime24 \ --temperature 0.6 \ --topp 0.95 \ --batch-size 2048 ``` We suggest following the paper config and running benchmarks with k different random seeds. ### Key Arguments #### Model Configuration (Required) - `--model-folder`: Directory containing model weights - `--model-name`: Name of the model subdirectory - `--tokenizer-folder`: Directory containing tokenizer files - `--tokenizer-name`: Name of the tokenizer subdirectory #### Dataset Selection (Required for evaluation) - `--benchmark-folder`: Root directory containing all benchmark datasets - `--eval-dataset`: Name of the evaluation dataset (see supported datasets above) #### Inference Parameters (Optional) - `--temperature`: Sampling temperature (default: 0 for greedy decoding) - `--topp`: Top-p (nucleus) sampling threshold (default: 1.0) - `--topk`: Top-k sampling threshold (default: 1) - `--max-output-len`: Maximum output length in tokens (default: 2048) - `--batch-size`: Batch size for inference (default: 16) - `--tensor-parallel-size`: Number of GPUs for tensor parallelism (default: 1) #### Dataset Subsetting (Optional) - `--start-idx`: Starting index for dataset subsetting (default: -1, disabled) - `--end-idx`: Ending index for dataset subsetting (default: -1, disabled) #### Other Options - `--seed`: Random seed for reproducibility (default: 42) - `--no-think`: Disable thinking mode (flag, thinking enabled by default) - `--yarn-factor`: Scaling factor for YaRN RoPE extension (default: 1) - `--device-id`: Comma-separated GPU device IDs (optional) - `--model-output-path`: Path to first turn output (required for mtbench_secondturn only) ## Supported Datasets - `aime24` / `aime25`: AIME competition problems - `lcb5` / `lcb6`: LiveCodeBench (versions 5 and 6) - `mmlu`: MMLU 5-shot evaluation - `mmlu_pro`: MMLU Pro dataset - `gpqa_diamond`: GPQA Diamond subset - `ifeval`: IFEval instruction following - `ifbench`: IFBench instruction following - `arena_hard`: Arena-Hard v0.1 ## Running Evaluation Scripts After generating model outputs using `inference.py`, you can compute metrics using the evaluation scripts in the `eval/` directory. We also attach our cached generation files in the corresponding model repo for reproducibility. ### Math Benchmarks (AIME24, AIME25) Evaluate math problem-solving performance: ```bash cd eval python get_scores_math.py \ --modelfolder /path/to/model/outputs \ --testfolder /path/to/test_benchmarks ``` This script: - Evaluates AIME24 and AIME25 benchmarks - Extracts answers from `\boxed{}` and other formats - Computes accuracy with mathematical equivalence checking - Reports mean accuracy and standard deviation across multiple runs ### Multiple Choice (MMLU, MMLU-Pro, GPQA) Evaluate MMLU and variants: ```bash cd eval python get_scores_mmlu_batch.py \ --modelfolder /path/to/model/outputs \ --testfolder /path/to/test_benchmarks \ --verbose # Optional: print per-category accuracy ``` This script evaluates: - **MMLU**: Standard MMLU with 4 choices (A-D) - **MMLU-Pro**: Extended version with up to 16 choices (A-P) Features: - Supports boxed answer format (e.g., `\boxed{A}`) - Extracts letter choices from various formats (parentheses, text, etc.) - Handles batch-split output files automatically - Computes accuracy across all MMLU variants - Optional per-category breakdown with `--verbose` flag Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance: ```bash cd eval python get_scores_gpqa.py \ --modelfolder /path/to/model/outputs \ --testfolder /path/to/test_benchmarks ``` This script: - Evaluates GPQA Diamond subset - Extracts answers from boxed and text formats - Uses mathematical equivalence checking for complex answers - Reports accuracy with standard deviation ### Code Generation (LiveCodeBench) Evaluate code generation performance: ```bash cd eval python get_scores_code.py \ --modelfolder /path/to/model/outputs \ --testfolder /path/to/test_benchmarks ``` This script: - Evaluates LiveCodeBench v5 and v6 - Executes generated code against test cases - Computes pass rate (percentage of problems solved correctly) - Reports finish rate (percentage of valid code generations) **Note**: Code execution requires: ```bash pip install numpy tqdm ``` ### Other Benchmarks For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions: - **Arena-Hard**: Use the [official Arena-Hard evaluation toolkit](https://github.com/lmarena/arena-hard-auto) - **IFEval**: Use the [official IFEval evaluation script](https://github.com/google-research/google-research/tree/master/instruction_following_eval) - **IFBench**: Use the [official IFBench evaluation toolkit](https://github.com/instruction-following/IFBench) These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code. ## Output Format Results are saved as JSONL files in: ``` {model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl ``` Each line contains: - `task_id` or `question_id`: Unique identifier for the question - `output`: Model's generated response - `reason`: Whether reasoning was used (boolean) - `reason_text`: The reasoning/thinking content (if applicable) - Additional dataset-specific fields ## Adding New Datasets To add a new dataset: 1. Add a preprocessing function in `data/benchmark.py`: ```python def preprocess_your_dataset(data_file): """Preprocess your dataset. Args: data_file: Path to dataset file Returns: tuple: (prompt_list, qid_list) or just prompt_list """ # Your preprocessing logic pass ``` 2. Add the dataset path argument in `arguments.py`: ```python group.add_argument('--your-dataset-path', type=str, default='path/to/dataset') ``` 3. Add the dataset case in `inference.py` in the `get_prompt_list()` function: ```python elif args.eval_dataset == "your_dataset": from data.benchmark import preprocess_your_dataset input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path) prompt_list, qid_list = preprocess_your_dataset(input_datapath) ``` ## Notes - The framework uses vLLM for efficient inference with batching and tensor parallelism support - Special handling is provided for models like DeepSeek-R1 that require eager mode - Thinking mode (`` tags) is supported for models trained with reasoning capabilities - YaRN RoPE scaling is supported for extended context lengths ## License See the main repository LICENSE file for licensing information.