llm_cp2 / src /lmms-eval /lmms_eval /tasks /aime /README.md

Upload folder using huggingface_hub

b0c0df0 verified about 2 months ago

1.99 kB

	# AIME

	## Paper
	The American Invitational Mathematics Examination (AIME) is a selective and prestigious 15-question 3-hour test given to high school students who qualify based on their AMC 10 or AMC 12 scores. All problems have integer answers between 0 and 999 inclusive. Questions increase in difficulty as the exam progresses.

	The AIME dataset evaluates mathematical problem-solving capabilities on competition-level mathematics problems.

	Homepage: https://huggingface.co/datasets/simplescaling/aime_nofigures

	## Dataset

	This implementation includes both:
	- `aime_nofigures`: AIME problems without figures/diagrams
	- `aime_figures`: AIME problems with figures/diagrams

	The dataset uses problems from AIME competitions, formatted for language model evaluation.

	## Groups and Tasks

	#### Groups

	- `math_word_problems`

	#### Tasks

	- `aime_nofigures`: AIME problems without figures
	- `aime_figures`: AIME problems with figures
	- `aime24_nofigures`: AIME 2024 problems without figures
	- `aime24_figures`: AIME 2024 problems with figures
	- `aime25_nofigures`: AIME 2025 problems without figures
	- Various aggregated versions (agg8, agg64) for multiple sampling

	### Evaluation

	The evaluation checks if the model's output matches the correct integer answer (0-999). The implementation includes:
	- Answer extraction from model outputs
	- Support for boxed answers (e.g., `\boxed{123}`)
	- Optional GPT-4o-mini based answer extraction for complex formats
	- Coverage and majority voting metrics for aggregated tasks

	### Environment Variables

	- `PROCESSOR=gpt-4o-mini`: Use GPT-4o-mini for answer extraction
	- `PROMPTSTEP`: Add thinking steps prompt
	- `PROMPTTOKEN`: Add thinking tokens prompt
	- `PROMPTLONG`: Add long thinking prompt
	- `PROMPTSHORT`: Add short thinking prompt

	### Checklist

	- [ ] Is in Eval-harness v1.0?
	- [ ] Has been checked for regression from v1.0?
	- [ ] Has been checked for equivalence with original paper methodology?
	- [ ] "Main" checked variant clearly denoted?