affine-basedmaxxing / evaluation /README.md

Cheeeeeeeeky

Duplicate from nvidia/Nemotron-Cascade-14B-Thinking

c504ae4 verified 4 months ago

preview code

raw

history blame

9.57 kB

LLM Evaluation Framework

This directory contains tools for evaluating large language models on various benchmarks.

Overview

The evaluation framework supports multiple benchmark datasets across different domains:

Math: AIME24, AIME25 (evaluation scripts provided)
Coding: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
Multiple Choice: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
Instruction Following: IFEval, IFBench (refer to official evaluation toolkits)
General Helpfulness: Arena-Hard (refer to official evaluation toolkit)

Installation

Install required dependencies:

pip install transformers vllm torch tqdm pandas

Directory Structure

evaluation/
├── inference.py                    # Main inference script
├── arguments.py                    # Command-line argument definitions
│
├── data/                          # Benchmark datasets and preprocessing
│   ├── benchmark.py               # Dataset preprocessing functions
│   ├── aime24/, aime25/           # AIME competition problems
│   ├── gpqa/                      # GPQA dataset
│   ├── livecodebench/             # LiveCodeBench v5 and v6
│   ├── mmlu/, mmlu_pro/           # MMLU variants
│   ├── arena-hard-v0.1/, arena-hard-v2.0/  # Arena-Hard benchmarks
│   ├── ifeval/, IFBench/          # Instruction following benchmarks
│   └── mt_bench/                  # MT-Bench data
│
├── eval/                          # Evaluation scripts
│   ├── get_scores_math.py         # Math benchmarks (AIME24, AIME25)
│   ├── get_scores_mmlu_batch.py   # MMLU, MMLU-Pro evaluation
│   ├── get_scores_gpqa.py         # GPQA evaluation
│   ├── get_scores_code.py         # Code benchmarks (LiveCodeBench)
│   └── tools/                     # Evaluation utilities
│       ├── grader.py              # Math answer grading
│       ├── code_verifier_utils.py # Code execution and verification
│       └── latex2sympy/           # LaTeX to SymPy conversion
│
├── run.sh                         # Example single benchmark run
├── run_local.sh                   # Local evaluation script
├── run_all.sh                     # Run multiple benchmarks in parallel
└── README.md                      # This file

Usage

Quick Start

Edit run.sh to configure your model and data paths
Run the evaluation:

bash run.sh

Advanced Usage

Run inference directly with custom parameters:

python inference.py \
    --model-folder /path/to/models \
    --model-name your-model \
    --tokenizer-folder /path/to/tokenizers \
    --tokenizer-name your-tokenizer \
    --benchmark-folder /path/to/benchmarks \
    --eval-dataset aime24 \
    --temperature 0.6 \
    --topp 0.95 \
    --batch-size 2048

We suggest following the paper config and running benchmarks with k different random seeds.

Key Arguments

Model Configuration (Required)

--model-folder: Directory containing model weights
--model-name: Name of the model subdirectory
--tokenizer-folder: Directory containing tokenizer files
--tokenizer-name: Name of the tokenizer subdirectory

Dataset Selection (Required for evaluation)

--benchmark-folder: Root directory containing all benchmark datasets
--eval-dataset: Name of the evaluation dataset (see supported datasets above)

Inference Parameters (Optional)

--temperature: Sampling temperature (default: 0 for greedy decoding)
--topp: Top-p (nucleus) sampling threshold (default: 1.0)
--topk: Top-k sampling threshold (default: 1)
--max-output-len: Maximum output length in tokens (default: 2048)
--batch-size: Batch size for inference (default: 16)
--tensor-parallel-size: Number of GPUs for tensor parallelism (default: 1)

Dataset Subsetting (Optional)

--start-idx: Starting index for dataset subsetting (default: -1, disabled)
--end-idx: Ending index for dataset subsetting (default: -1, disabled)

Other Options

--seed: Random seed for reproducibility (default: 42)
--no-think: Disable thinking mode (flag, thinking enabled by default)
--yarn-factor: Scaling factor for YaRN RoPE extension (default: 1)
--device-id: Comma-separated GPU device IDs (optional)
--model-output-path: Path to first turn output (required for mtbench_secondturn only)

Supported Datasets

aime24 / aime25: AIME competition problems
lcb5 / lcb6: LiveCodeBench (versions 5 and 6)
mmlu: MMLU 5-shot evaluation
mmlu_pro: MMLU Pro dataset
gpqa_diamond: GPQA Diamond subset
ifeval: IFEval instruction following
ifbench: IFBench instruction following
arena_hard: Arena-Hard v0.1

Running Evaluation Scripts

After generating model outputs using inference.py, you can compute metrics using the evaluation scripts in the eval/ directory.

We also attach our cached generation files in the corresponding model repo for reproducibility.

Math Benchmarks (AIME24, AIME25)

Evaluate math problem-solving performance:

cd eval
python get_scores_math.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

Evaluates AIME24 and AIME25 benchmarks
Extracts answers from \boxed{} and other formats
Computes accuracy with mathematical equivalence checking
Reports mean accuracy and standard deviation across multiple runs

Multiple Choice (MMLU, MMLU-Pro, GPQA)

Evaluate MMLU and variants:

cd eval
python get_scores_mmlu_batch.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks \
    --verbose  # Optional: print per-category accuracy

This script evaluates:

MMLU: Standard MMLU with 4 choices (A-D)
MMLU-Pro: Extended version with up to 16 choices (A-P)

Features:

Supports boxed answer format (e.g., \boxed{A})
Extracts letter choices from various formats (parentheses, text, etc.)
Handles batch-split output files automatically
Computes accuracy across all MMLU variants
Optional per-category breakdown with --verbose flag

Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:

cd eval
python get_scores_gpqa.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

Evaluates GPQA Diamond subset
Extracts answers from boxed and text formats
Uses mathematical equivalence checking for complex answers
Reports accuracy with standard deviation

Code Generation (LiveCodeBench)

Evaluate code generation performance:

cd eval
python get_scores_code.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

Evaluates LiveCodeBench v5 and v6
Executes generated code against test cases
Computes pass rate (percentage of problems solved correctly)
Reports finish rate (percentage of valid code generations)

Note: Code execution requires:

pip install numpy tqdm

Other Benchmarks

For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:

Arena-Hard: Use the official Arena-Hard evaluation toolkit
IFEval: Use the official IFEval evaluation script
IFBench: Use the official IFBench evaluation toolkit

These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.

Output Format

Results are saved as JSONL files in:

{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl

Each line contains:

task_id or question_id: Unique identifier for the question
output: Model's generated response
reason: Whether reasoning was used (boolean)
reason_text: The reasoning/thinking content (if applicable)
Additional dataset-specific fields

Adding New Datasets

To add a new dataset:

Add a preprocessing function in data/benchmark.py:

def preprocess_your_dataset(data_file):
    """Preprocess your dataset.
    
    Args:
        data_file: Path to dataset file
    
    Returns:
        tuple: (prompt_list, qid_list) or just prompt_list
    """
    # Your preprocessing logic
    pass

Add the dataset path argument in arguments.py:

group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')

Add the dataset case in inference.py in the get_prompt_list() function:

elif args.eval_dataset == "your_dataset":
    from data.benchmark import preprocess_your_dataset
    input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
    prompt_list, qid_list = preprocess_your_dataset(input_datapath)

Notes

The framework uses vLLM for efficient inference with batching and tensor parallelism support
Special handling is provided for models like DeepSeek-R1 that require eager mode
Thinking mode (<think> tags) is supported for models trained with reasoning capabilities
YaRN RoPE scaling is supported for extended context lengths

License

See the main repository LICENSE file for licensing information.