Cheeeeeeeeky's picture
Duplicate from nvidia/Nemotron-Cascade-14B-Thinking
c504ae4 verified
|
raw
history blame
9.57 kB

LLM Evaluation Framework

This directory contains tools for evaluating large language models on various benchmarks.

Overview

The evaluation framework supports multiple benchmark datasets across different domains:

  • Math: AIME24, AIME25 (evaluation scripts provided)
  • Coding: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
  • Multiple Choice: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
  • Instruction Following: IFEval, IFBench (refer to official evaluation toolkits)
  • General Helpfulness: Arena-Hard (refer to official evaluation toolkit)

Installation

Install required dependencies:

pip install transformers vllm torch tqdm pandas

Directory Structure

evaluation/
β”œβ”€β”€ inference.py                    # Main inference script
β”œβ”€β”€ arguments.py                    # Command-line argument definitions
β”‚
β”œβ”€β”€ data/                          # Benchmark datasets and preprocessing
β”‚   β”œβ”€β”€ benchmark.py               # Dataset preprocessing functions
β”‚   β”œβ”€β”€ aime24/, aime25/           # AIME competition problems
β”‚   β”œβ”€β”€ gpqa/                      # GPQA dataset
β”‚   β”œβ”€β”€ livecodebench/             # LiveCodeBench v5 and v6
β”‚   β”œβ”€β”€ mmlu/, mmlu_pro/           # MMLU variants
β”‚   β”œβ”€β”€ arena-hard-v0.1/, arena-hard-v2.0/  # Arena-Hard benchmarks
β”‚   β”œβ”€β”€ ifeval/, IFBench/          # Instruction following benchmarks
β”‚   └── mt_bench/                  # MT-Bench data
β”‚
β”œβ”€β”€ eval/                          # Evaluation scripts
β”‚   β”œβ”€β”€ get_scores_math.py         # Math benchmarks (AIME24, AIME25)
β”‚   β”œβ”€β”€ get_scores_mmlu_batch.py   # MMLU, MMLU-Pro evaluation
β”‚   β”œβ”€β”€ get_scores_gpqa.py         # GPQA evaluation
β”‚   β”œβ”€β”€ get_scores_code.py         # Code benchmarks (LiveCodeBench)
β”‚   └── tools/                     # Evaluation utilities
β”‚       β”œβ”€β”€ grader.py              # Math answer grading
β”‚       β”œβ”€β”€ code_verifier_utils.py # Code execution and verification
β”‚       └── latex2sympy/           # LaTeX to SymPy conversion
β”‚
β”œβ”€β”€ run.sh                         # Example single benchmark run
β”œβ”€β”€ run_local.sh                   # Local evaluation script
β”œβ”€β”€ run_all.sh                     # Run multiple benchmarks in parallel
└── README.md                      # This file

Usage

Quick Start

  1. Edit run.sh to configure your model and data paths
  2. Run the evaluation:
bash run.sh

Advanced Usage

Run inference directly with custom parameters:

python inference.py \
    --model-folder /path/to/models \
    --model-name your-model \
    --tokenizer-folder /path/to/tokenizers \
    --tokenizer-name your-tokenizer \
    --benchmark-folder /path/to/benchmarks \
    --eval-dataset aime24 \
    --temperature 0.6 \
    --topp 0.95 \
    --batch-size 2048

We suggest following the paper config and running benchmarks with k different random seeds.

Key Arguments

Model Configuration (Required)

  • --model-folder: Directory containing model weights
  • --model-name: Name of the model subdirectory
  • --tokenizer-folder: Directory containing tokenizer files
  • --tokenizer-name: Name of the tokenizer subdirectory

Dataset Selection (Required for evaluation)

  • --benchmark-folder: Root directory containing all benchmark datasets
  • --eval-dataset: Name of the evaluation dataset (see supported datasets above)

Inference Parameters (Optional)

  • --temperature: Sampling temperature (default: 0 for greedy decoding)
  • --topp: Top-p (nucleus) sampling threshold (default: 1.0)
  • --topk: Top-k sampling threshold (default: 1)
  • --max-output-len: Maximum output length in tokens (default: 2048)
  • --batch-size: Batch size for inference (default: 16)
  • --tensor-parallel-size: Number of GPUs for tensor parallelism (default: 1)

Dataset Subsetting (Optional)

  • --start-idx: Starting index for dataset subsetting (default: -1, disabled)
  • --end-idx: Ending index for dataset subsetting (default: -1, disabled)

Other Options

  • --seed: Random seed for reproducibility (default: 42)
  • --no-think: Disable thinking mode (flag, thinking enabled by default)
  • --yarn-factor: Scaling factor for YaRN RoPE extension (default: 1)
  • --device-id: Comma-separated GPU device IDs (optional)
  • --model-output-path: Path to first turn output (required for mtbench_secondturn only)

Supported Datasets

  • aime24 / aime25: AIME competition problems
  • lcb5 / lcb6: LiveCodeBench (versions 5 and 6)
  • mmlu: MMLU 5-shot evaluation
  • mmlu_pro: MMLU Pro dataset
  • gpqa_diamond: GPQA Diamond subset
  • ifeval: IFEval instruction following
  • ifbench: IFBench instruction following
  • arena_hard: Arena-Hard v0.1

Running Evaluation Scripts

After generating model outputs using inference.py, you can compute metrics using the evaluation scripts in the eval/ directory.

We also attach our cached generation files in the corresponding model repo for reproducibility.

Math Benchmarks (AIME24, AIME25)

Evaluate math problem-solving performance:

cd eval
python get_scores_math.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates AIME24 and AIME25 benchmarks
  • Extracts answers from \boxed{} and other formats
  • Computes accuracy with mathematical equivalence checking
  • Reports mean accuracy and standard deviation across multiple runs

Multiple Choice (MMLU, MMLU-Pro, GPQA)

Evaluate MMLU and variants:

cd eval
python get_scores_mmlu_batch.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks \
    --verbose  # Optional: print per-category accuracy

This script evaluates:

  • MMLU: Standard MMLU with 4 choices (A-D)
  • MMLU-Pro: Extended version with up to 16 choices (A-P)

Features:

  • Supports boxed answer format (e.g., \boxed{A})
  • Extracts letter choices from various formats (parentheses, text, etc.)
  • Handles batch-split output files automatically
  • Computes accuracy across all MMLU variants
  • Optional per-category breakdown with --verbose flag

Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:

cd eval
python get_scores_gpqa.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates GPQA Diamond subset
  • Extracts answers from boxed and text formats
  • Uses mathematical equivalence checking for complex answers
  • Reports accuracy with standard deviation

Code Generation (LiveCodeBench)

Evaluate code generation performance:

cd eval
python get_scores_code.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates LiveCodeBench v5 and v6
  • Executes generated code against test cases
  • Computes pass rate (percentage of problems solved correctly)
  • Reports finish rate (percentage of valid code generations)

Note: Code execution requires:

pip install numpy tqdm

Other Benchmarks

For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:

These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.

Output Format

Results are saved as JSONL files in:

{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl

Each line contains:

  • task_id or question_id: Unique identifier for the question
  • output: Model's generated response
  • reason: Whether reasoning was used (boolean)
  • reason_text: The reasoning/thinking content (if applicable)
  • Additional dataset-specific fields

Adding New Datasets

To add a new dataset:

  1. Add a preprocessing function in data/benchmark.py:

    def preprocess_your_dataset(data_file):
        """Preprocess your dataset.
        
        Args:
            data_file: Path to dataset file
        
        Returns:
            tuple: (prompt_list, qid_list) or just prompt_list
        """
        # Your preprocessing logic
        pass
    
  2. Add the dataset path argument in arguments.py:

    group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')
    
  3. Add the dataset case in inference.py in the get_prompt_list() function:

    elif args.eval_dataset == "your_dataset":
        from data.benchmark import preprocess_your_dataset
        input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
        prompt_list, qid_list = preprocess_your_dataset(input_datapath)
    

Notes

  • The framework uses vLLM for efficient inference with batching and tensor parallelism support
  • Special handling is provided for models like DeepSeek-R1 that require eager mode
  • Thinking mode (<think> tags) is supported for models trained with reasoning capabilities
  • YaRN RoPE scaling is supported for extended context lengths

License

See the main repository LICENSE file for licensing information.