koichi12 commited on Nov 28, 2024

Commit

181156e

verified ·

1 Parent(s): f69a342

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/README.md +50 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_aclue.yaml +26 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_default_template_yaml +18 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_generate_configs.py +82 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_chinese_culture.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_literature.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_medical.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/README.md +60 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml +18 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2da.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3da.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4da.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5da.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml +5 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_default_ceval_yaml +18 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_generate_configs.py +142 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_basic_medicine.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_college_physics.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_fire_engineer.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chemistry.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chinese.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_mathematics.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_politics.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_biology.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_chemistry.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_physician.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_professional_tour_guide.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_sports_science.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_teacher_qualification.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/README.md +60 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/default.yaml +12 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/README.md +47 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/colloquial.yaml +4 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/standard.yaml +14 -0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/README.md ADDED Viewed

	@@ -0,0 +1,50 @@

+# ACLUE
+### Paper
+Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE
+https://arxiv.org/abs/2310.09550
+The Ancient Chinese Language Understanding Evaluation (ACLUE) is an evaluation benchmark focused on ancient Chinese language comprehension. It aims to assess the performance of large-scale language models on understanding ancient Chinese. The benchmark comprises 15 tasks spanning various domains, including lexical, syntactic, semantic, inference, and knowledge. ACLUE's tasks are derived from a combination of manually curated questions from publicly available resources, and automatically
+generated questions from classical Chinese language corpora. The range of questions span from the Xia dynasty (2070 BCE) to the Ming dynasty (1368 CE). ACLUE adopts a multiple-choice question format for all tasks.
+Homepage: https://github.com/isen-zhang/ACLUE
+### Citation
+```bibtex
+@inproceedings{zhang-li-2023-large,
+    title = "Can Large Langauge Model Comprehend {A}ncient {C}hinese? A Preliminary Test on {ACLUE}",
+    author = "Zhang, Yixuan  and Li, Haonan",
+    booktitle = "Proceedings of the Ancient Language Processing Workshop",
+    month = sep,
+    year = "2023",
+    address = "Varna, Bulgaria",
+    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
+    url = "https://aclanthology.org/2023.alp-1.9",
+    pages = "80--87"
+}
+```
+### Groups, Tags, and Tasks
+#### Groups
+- `aclue`: All 15 subjects of the ACLUE dataset, evaluated following the methodology in CMMLU's original implementation.
+#### Tasks
+The following tasks evaluate subjects in the ACLUE dataset using loglikelihood-based multiple-choice scoring:
+- `aclue_{subject_english}`
+### Checklist
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation?
+    * [x] Yes, original implementation contributed by author of the benchmark
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_aclue.yaml ADDED Viewed

	@@ -0,0 +1,26 @@

+group: aclue
+task:
+  - aclue_ancient_chinese_culture
+  - aclue_ancient_literature
+  - aclue_ancient_medical
+  - aclue_ancient_phonetics
+  - aclue_basic_ancient_chinese
+  - aclue_couplet_prediction
+  - aclue_homographic_character_resolution
+  - aclue_named_entity_recognition
+  - aclue_poetry_appreciate
+  - aclue_poetry_context_prediction
+  - aclue_poetry_quality_assessment
+  - aclue_poetry_sentiment_analysis
+  - aclue_polysemy_resolution
+  - aclue_reading_comprehension
+  - aclue_sentence_segmentation
+aggregate_metric_list:
+  - metric: acc
+    aggregation: mean
+    weight_by_size: true
+  - metric: acc_norm
+    aggregation: mean
+    weight_by_size: true
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_default_template_yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+dataset_path: tyouisen/aclue
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{Question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n答案："
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: "{{['A', 'B', 'C', 'D'].index(Answer)}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_generate_configs.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Take in a YAML, and output all other splits with this YAML
+"""
+import argparse
+import os
+import yaml
+from tqdm import tqdm
+from lm_eval.utils import eval_logger
+SUBJECTS = {
+    "古文单字多义": "polysemy_resolution",
+    "诗词情感分类": "poetry_sentiment_analysis",
+    "古汉语命名体识别": "named_entity_recognition",
+    "古汉语知识": "basic_ancient_chinese",
+    "古诗词上下句预测": "poetry_context_prediction",
+    "古文断句": "sentence_segmentation",
+    "对联": "couplet_prediction",
+    "古诗词曲鉴赏": "poetry_appreciate",
+    "国学常识": "ancient_chinese_culture",
+    "古音学": "ancient_phonetics",
+    "通假字": "homographic_character_resolution",
+    "古代文学知识": "ancient_literature",
+    "医古文": "ancient_medical",
+    "古诗词质量评估": "poetry_quality_assessment",
+    "古文阅读理解": "reading_comprehension",
+}
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_yaml_path", required=True)
+    parser.add_argument("--save_prefix_path", default="aclue")
+    parser.add_argument("--cot_prompt_path", default=None)
+    parser.add_argument("--task_prefix", default="")
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
+    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
+    with open(args.base_yaml_path, encoding="utf-8") as f:
+        base_yaml = yaml.full_load(f)
+    if args.cot_prompt_path is not None:
+        import json
+        with open(args.cot_prompt_path, encoding="utf-8") as f:
+            cot_file = json.load(f)
+    for subject_zh, subject_eng in tqdm(SUBJECTS.items()):
+        if args.cot_prompt_path is not None:
+            description = cot_file[subject_eng]
+        else:
+            description = (
+                f"以下是关于{subject_zh}的单项选择题，请直接给出正确答案的选项。\n\n"
+            )
+        yaml_dict = {
+            "include": base_yaml_name,
+            "task": f"aclue_{args.task_prefix}_{subject_eng}"
+            if args.task_prefix != ""
+            else f"aclue_{subject_eng}",
+            "dataset_name": subject_eng,
+            "description": description,
+        }
+        file_save_path = args.save_prefix_path + f"_{subject_eng}.yaml"
+        eval_logger.info(f"Saving yaml for subset {subject_eng} to {file_save_path}")
+        with open(file_save_path, "w", encoding="utf-8") as yaml_file:
+            yaml.dump(
+                yaml_dict,
+                yaml_file,
+                width=float("inf"),
+                allow_unicode=True,
+                default_style='"',
+            )

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_chinese_culture.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "ancient_chinese_culture"
+"description": "以下是关于国学常识的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_chinese_culture"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_literature.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "ancient_literature"
+"description": "以下是关于古代文学知识的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_literature"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_medical.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "ancient_medical"
+"description": "以下是关于医古文的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_medical"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "ancient_phonetics"
+"description": "以下是关于古音学的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_ancient_phonetics"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "basic_ancient_chinese"
+"description": "以下是关于古汉语知识的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_basic_ancient_chinese"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "couplet_prediction"
+"description": "以下是关于对联的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_couplet_prediction"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "homographic_character_resolution"
+"description": "以下是关于通假字的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_homographic_character_resolution"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "named_entity_recognition"
+"description": "以下是关于古汉语命名体识别的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_named_entity_recognition"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "poetry_appreciate"
+"description": "以下是关于古诗词曲鉴赏的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_appreciate"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "poetry_context_prediction"
+"description": "以下是关于古诗词上下句预测的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_context_prediction"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "poetry_quality_assessment"
+"description": "以下是关于古诗词质量评估的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_quality_assessment"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "poetry_sentiment_analysis"
+"description": "以下是关于诗词情感分类的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_poetry_sentiment_analysis"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "polysemy_resolution"
+"description": "以下是关于古文单字多义的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_polysemy_resolution"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "reading_comprehension"
+"description": "以下是关于古文阅读理解的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_reading_comprehension"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "sentence_segmentation"
+"description": "以下是关于古文断句的单项选择题，请直接给出正确答案的选项。\n\n"
+"include": "_default_template_yaml"
+"task": "aclue_sentence_segmentation"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Arithmetic
+### Paper
+Title: `Language Models are Few-Shot Learners`
+Abstract: https://arxiv.org/abs/2005.14165
+A small battery of 10 tests that involve asking language models a simple arithmetic
+problem in natural language.
+Homepage: https://github.com/openai/gpt-3/tree/master/data
+### Citation
+```
+@inproceedings{NEURIPS2020_1457c0d6,
+    author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
+    booktitle = {Advances in Neural Information Processing Systems},
+    editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
+    pages = {1877--1901},
+    publisher = {Curran Associates, Inc.},
+    title = {Language Models are Few-Shot Learners},
+    url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
+    volume = {33},
+    year = {2020}
+}
+```
+### Groups, Tags, and Tasks
+#### Tags
+* `arithmetic`: Evaluates `1dc` to `5ds`
+#### Tasks
+* `arithmetic_1dc`
+* `arithmetic_2da`
+* `arithmetic_2dm`
+* `arithmetic_2ds`
+* `arithmetic_3da`
+* `arithmetic_3ds`
+* `arithmetic_4da`
+* `arithmetic_4ds`
+* `arithmetic_5da`
+* `arithmetic_5ds`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+tag:
+  - arithmetic
+task: arithmetic_1dc
+dataset_path: EleutherAI/arithmetic
+dataset_name: arithmetic_1dc
+output_type: loglikelihood
+validation_split: validation
+test_split: null
+doc_to_text: "{{context}}"
+doc_to_target: "{{completion}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2da.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_2da
+dataset_name: arithmetic_2da
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_2dm
+dataset_name: arithmetic_2dm
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_2ds
+dataset_name: arithmetic_2ds
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3da.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_3da
+dataset_name: arithmetic_3da
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_3ds
+dataset_name: arithmetic_3ds
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4da.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_4da
+dataset_name: arithmetic_4da
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_4ds
+dataset_name: arithmetic_4ds
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5da.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_5da
+dataset_name: arithmetic_5da
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml ADDED Viewed

	@@ -0,0 +1,5 @@

+include: arithmetic_1dc.yaml
+task: arithmetic_5ds
+dataset_name: arithmetic_5ds
+dataset_kwargs:
+  trust_remote_code: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_default_ceval_yaml ADDED Viewed

	@@ -0,0 +1,18 @@

+dataset_path: ceval/ceval-exam
+validation_split: val
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n答案："
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: "{{['A', 'B', 'C', 'D'].index(answer)}}"
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 2.0

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_generate_configs.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+Take in a YAML, and output all other splits with this YAML
+"""
+import argparse
+import os
+import yaml
+from tqdm import tqdm
+from lm_eval.utils import eval_logger
+SUBJECTS = {
+    "computer_network": "计算机网络",
+    "operating_system": "操作系统",
+    "computer_architecture": "计算机组成",
+    "college_programming": "大学编程",
+    "college_physics": "大学物理",
+    "college_chemistry": "大学化学",
+    "advanced_mathematics": "高等数学",
+    "probability_and_statistics": "概率统计",
+    "discrete_mathematics": "离散数学",
+    "electrical_engineer": "注册电气工程师",
+    "metrology_engineer": "注册计量师",
+    "high_school_mathematics": "高中数学",
+    "high_school_physics": "高中物理",
+    "high_school_chemistry": "高中化学",
+    "high_school_biology": "高中生物",
+    "middle_school_mathematics": "初中数学",
+    "middle_school_biology": "初中生物",
+    "middle_school_physics": "初中物理",
+    "middle_school_chemistry": "初中化学",
+    "veterinary_medicine": "兽医学",
+    "college_economics": "大学经济学",
+    "business_administration": "工商管理",
+    "marxism": "马克思主义基本原理",
+    "mao_zedong_thought": "毛泽东思想和中国特色社会主义理论体系概论",
+    "education_science": "教育学",
+    "teacher_qualification": "教师资格",
+    "high_school_politics": "高中政治",
+    "high_school_geography": "高中地理",
+    "middle_school_politics": "初中政治",
+    "middle_school_geography": "初中地理",
+    "modern_chinese_history": "近代史纲要",
+    "ideological_and_moral_cultivation": "思想道德修养与法律基础",
+    "logic": "逻辑学",
+    "law": "法学",
+    "chinese_language_and_literature": "中国语言文学",
+    "art_studies": "艺术学",
+    "professional_tour_guide": "导游资格",
+    "legal_professional": "法律职业资格",
+    "high_school_chinese": "高中语文",
+    "high_school_history": "高中历史",
+    "middle_school_history": "初中历史",
+    "civil_servant": "公务员",
+    "sports_science": "体育学",
+    "plant_protection": "植物保护",
+    "basic_medicine": "基础医学",
+    "clinical_medicine": "临床医学",
+    "urban_and_rural_planner": "注册城乡规划师",
+    "accountant": "注册会计师",
+    "fire_engineer": "注册消防工程师",
+    "environmental_impact_assessment_engineer": "环境影响评价工程师",
+    "tax_accountant": "税务师",
+    "physician": "医师资格",
+}
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base_yaml_path", required=True)
+    parser.add_argument("--save_prefix_path", default="ceval-valid")
+    parser.add_argument("--cot_prompt_path", default=None)
+    parser.add_argument("--task_prefix", default="")
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
+    base_yaml_name = os.path.split(args.base_yaml_path)[-1]
+    with open(args.base_yaml_path, encoding="utf-8") as f:
+        base_yaml = yaml.full_load(f)
+    if args.cot_prompt_path is not None:
+        import json
+        with open(args.cot_prompt_path, encoding="utf-8") as f:
+            cot_file = json.load(f)
+    for subject_eng, subject_zh in tqdm(SUBJECTS.items()):
+        if args.cot_prompt_path is not None:
+            description = cot_file[subject_eng]
+        else:
+            description = (
+                f"以下是中国关于{subject_zh}的单项选择题，请选出其中的正确答案。\n\n"
+            )
+        yaml_dict = {
+            "include": base_yaml_name,
+            "task": f"ceval-valid_{args.task_prefix}_{subject_eng}"
+            if args.task_prefix != ""
+            else f"ceval-valid_{subject_eng}",
+            "dataset_name": subject_eng,
+            "description": description,
+        }
+        file_save_path = args.save_prefix_path + f"_{subject_eng}.yaml"
+        eval_logger.info(f"Saving yaml for subset {subject_eng} to {file_save_path}")
+        with open(file_save_path, "w", encoding="utf-8") as yaml_file:
+            yaml.dump(
+                yaml_dict,
+                yaml_file,
+                width=float("inf"),
+                allow_unicode=True,
+                default_style='"',
+            )
+    # write group config out
+    group_yaml_dict = {
+        "group": "ceval-valid",
+        "task": [f"ceval-valid_{task_name}" for task_name in SUBJECTS.keys()],
+        "aggregate_metric_list": [
+            {"metric": "acc", "aggregation": "mean", "weight_by_size": True},
+            {"metric": "acc_norm", "aggregation": "mean", "weight_by_size": True},
+        ],
+        "metadata": {"version": 1.0},
+    }
+    file_save_path = "_" + args.save_prefix_path + ".yaml"
+    with open(file_save_path, "w", encoding="utf-8") as group_yaml_file:
+        yaml.dump(
+            group_yaml_dict,
+            group_yaml_file,
+            width=float("inf"),
+            allow_unicode=True,
+            default_style='"',
+        )

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_basic_medicine.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "basic_medicine"
+"description": "以下是中国关于基础医学的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_basic_medicine"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_college_physics.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "college_physics"
+"description": "以下是中国关于大学物理的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_college_physics"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_fire_engineer.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "fire_engineer"
+"description": "以下是中国关于注册消防工程师的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_fire_engineer"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chemistry.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "high_school_chemistry"
+"description": "以下是中国关于高中化学的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_high_school_chemistry"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chinese.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "high_school_chinese"
+"description": "以下是中国关于高中语文的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_high_school_chinese"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_mathematics.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "high_school_mathematics"
+"description": "以下是中国关于高中数学的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_high_school_mathematics"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_politics.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "high_school_politics"
+"description": "以下是中国关于高中政治的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_high_school_politics"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_biology.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "middle_school_biology"
+"description": "以下是中国关于初中生物的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_middle_school_biology"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_chemistry.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "middle_school_chemistry"
+"description": "以下是中国关于初中化学的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_middle_school_chemistry"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_physician.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "physician"
+"description": "以下是中国关于医师资格的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_physician"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_professional_tour_guide.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "professional_tour_guide"
+"description": "以下是中国关于导游资格的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_professional_tour_guide"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_sports_science.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "sports_science"
+"description": "以下是中国关于体育学的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_sports_science"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_teacher_qualification.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+"dataset_name": "teacher_qualification"
+"description": "以下是中国关于教师资格的单项选择题，请选出其中的正确答案。\n\n"
+"include": "_default_ceval_yaml"
+"task": "ceval-valid_teacher_qualification"

scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/README.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Task-name
+### Paper
+Title: `COMMONSENSEQA: A Question Answering Challenge Targeting
+Commonsense Knowledge`
+Abstract: https://arxiv.org/pdf/1811.00937.pdf
+CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers.
+It contains 12,102 questions with one correct answer and four distractor answers.
+Homepage: https://www.tau-nlp.org/commonsenseqa
+### Citation
+```
+@inproceedings{talmor-etal-2019-commonsenseqa,
+    title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge",
+    author = "Talmor, Alon  and
+      Herzig, Jonathan  and
+      Lourie, Nicholas  and
+      Berant, Jonathan",
+    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
+    month = jun,
+    year = "2019",
+    address = "Minneapolis, Minnesota",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/N19-1421",
+    doi = "10.18653/v1/N19-1421",
+    pages = "4149--4158",
+    archivePrefix = "arXiv",
+    eprint        = "1811.00937",
+    primaryClass  = "cs",
+}
+```
+### Groups and Tasks
+#### Groups
+* Not part of a group yet.
+#### Tasks
+* `commonsense_qa`: Represents the "random" split from the paper. Uses an MMLU-style prompt, as (presumably) used by Llama evaluations.
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/default.yaml ADDED Viewed

	@@ -0,0 +1,12 @@

+task: commonsense_qa
+dataset_path: tau/commonsense_qa
+training_split: train
+validation_split: validation
+output_type: multiple_choice
+doc_to_text: "Question: {{ question.strip() }}\nA. {{choices['text'][0]}}\nB. {{choices['text'][1]}}\nC. {{choices['text'][2]}}\nD. {{choices['text'][3]}}\nE. {{choices['text'][4]}}\nAnswer:"
+doc_to_target: answerKey
+doc_to_choice: ['A', 'B', 'C', 'D', 'E']
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true

scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/README.md ADDED Viewed

	@@ -0,0 +1,47 @@

+# COPAL
+### Paper
+Title: `COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances`
+Abstract: `https://arxiv.org/abs/2311.01012`
+`COPAL-ID is an Indonesian causal commonsense reasoning dataset that captures local nuances. It provides a more natural portrayal of day-to-day causal reasoning within the Indonesian (especially Jakartan) cultural sphere. Professionally written and validatid from scratch by natives, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID.`
+Homepage: `https://github.com/haryoa/copal-id`
+### Citation
+```
+@article{wibowo2023copal,
+  title={COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances},
+  author={Wibowo, Haryo Akbarianto and Fuadi, Erland Hilman and Nityasya, Made Nindyatama and Prasojo, Radityo Eko and Aji, Alham Fikri},
+  journal={arXiv preprint arXiv:2311.01012},
+  year={2023}
+}
+```
+### Groups and Tasks
+#### Groups
+* `copal_id`
+#### Tasks
+* `copal_id_standard`: `Standard version of COPAL dataset, use formal language and less local nuances`
+* `copal_id_colloquial`: `Colloquial version of COPAL dataset, use informal language and more local nuances`
+### Checklist
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?

scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/colloquial.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+include: standard.yaml
+task: copal_id_colloquial
+task_alias: colloquial
+test_split: test_colloquial

scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/standard.yaml ADDED Viewed

	@@ -0,0 +1,14 @@

+tag: copal_id
+task: copal_id_standard
+task_alias: standard
+dataset_path: haryoaw/COPAL
+dataset_name: id
+output_type: multiple_choice
+test_split: test
+doc_to_text: !function utils.doc_to_text_id
+doc_to_target: label
+doc_to_choice: !function utils.doc_to_choice
+metric_list:
+  - metric: acc
+metadata:
+  version: 1.0