llm_cp2 / src /lmms-eval /lmms_eval /tasks /arc /README.md

Upload folder using huggingface_hub

b0c0df0 verified about 2 months ago

1.93 kB

	# ARC

	### Paper

	Title: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

	Abstract: https://arxiv.org/abs/1803.05457

	The ARC dataset consists of 7,787 science exam questions drawn from a variety
	of sources, including science questions provided under license by a research
	partner affiliated with AI2. These are text-only, English language exam questions
	that span several grade levels as indicated in the files. Each question has a
	multiple choice structure (typically 4 answer options). The questions are sorted
	into a Challenge Set of 2,590 “hard” questions (those that both a retrieval and
	a co-occurrence method fail to answer correctly) and an Easy Set of 5,197 questions.

	Homepage: https://allenai.org/data/arc


	### Citation

	```
	@article{Clark2018ThinkYH,
	title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
	author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
	journal={ArXiv},
	year={2018},
	volume={abs/1803.05457}
	}
	```

	### Groups, Tags, and Tasks

	#### Groups

	None.

	#### Tags

	* `ai2_arc`: Evaluates `arc_easy` and `arc_challenge`

	#### Tasks

	* `arc_easy`
	* `arc_challenge`

	### Checklist

	For adding novel benchmarks/datasets to the library:
	* [ ] Is the task an existing benchmark in the literature?
	* [ ] Have you referenced the original paper that introduced the task?
	* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


	If other tasks on this dataset are already supported:
	* [ ] Is the "Main" variant of this task clearly denoted?
	* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
	* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?