koichi12 commited on
Commit
181156e
·
verified ·
1 Parent(s): f69a342

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/README.md +50 -0
  2. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_aclue.yaml +26 -0
  3. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_default_template_yaml +18 -0
  4. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_generate_configs.py +82 -0
  5. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_chinese_culture.yaml +4 -0
  6. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_literature.yaml +4 -0
  7. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_medical.yaml +4 -0
  8. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml +4 -0
  9. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml +4 -0
  10. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml +4 -0
  11. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml +4 -0
  12. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml +4 -0
  13. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml +4 -0
  14. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml +4 -0
  15. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml +4 -0
  16. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml +4 -0
  17. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml +4 -0
  18. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml +4 -0
  19. scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml +4 -0
  20. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/README.md +60 -0
  21. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml +18 -0
  22. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2da.yaml +5 -0
  23. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml +5 -0
  24. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml +5 -0
  25. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3da.yaml +5 -0
  26. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml +5 -0
  27. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4da.yaml +5 -0
  28. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml +5 -0
  29. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5da.yaml +5 -0
  30. scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml +5 -0
  31. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_default_ceval_yaml +18 -0
  32. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_generate_configs.py +142 -0
  33. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_basic_medicine.yaml +4 -0
  34. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_college_physics.yaml +4 -0
  35. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_fire_engineer.yaml +4 -0
  36. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chemistry.yaml +4 -0
  37. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chinese.yaml +4 -0
  38. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_mathematics.yaml +4 -0
  39. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_politics.yaml +4 -0
  40. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_biology.yaml +4 -0
  41. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_chemistry.yaml +4 -0
  42. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_physician.yaml +4 -0
  43. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_professional_tour_guide.yaml +4 -0
  44. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_sports_science.yaml +4 -0
  45. scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_teacher_qualification.yaml +4 -0
  46. scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/README.md +60 -0
  47. scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/default.yaml +12 -0
  48. scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/README.md +47 -0
  49. scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/colloquial.yaml +4 -0
  50. scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/standard.yaml +14 -0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/README.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ACLUE
2
+
3
+ ### Paper
4
+
5
+ Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test on ACLUE
6
+ https://arxiv.org/abs/2310.09550
7
+
8
+ The Ancient Chinese Language Understanding Evaluation (ACLUE) is an evaluation benchmark focused on ancient Chinese language comprehension. It aims to assess the performance of large-scale language models on understanding ancient Chinese. The benchmark comprises 15 tasks spanning various domains, including lexical, syntactic, semantic, inference, and knowledge. ACLUE's tasks are derived from a combination of manually curated questions from publicly available resources, and automatically
9
+ generated questions from classical Chinese language corpora. The range of questions span from the Xia dynasty (2070 BCE) to the Ming dynasty (1368 CE). ACLUE adopts a multiple-choice question format for all tasks.
10
+
11
+ Homepage: https://github.com/isen-zhang/ACLUE
12
+
13
+ ### Citation
14
+
15
+ ```bibtex
16
+ @inproceedings{zhang-li-2023-large,
17
+ title = "Can Large Langauge Model Comprehend {A}ncient {C}hinese? A Preliminary Test on {ACLUE}",
18
+ author = "Zhang, Yixuan and Li, Haonan",
19
+ booktitle = "Proceedings of the Ancient Language Processing Workshop",
20
+ month = sep,
21
+ year = "2023",
22
+ address = "Varna, Bulgaria",
23
+ publisher = "INCOMA Ltd., Shoumen, Bulgaria",
24
+ url = "https://aclanthology.org/2023.alp-1.9",
25
+ pages = "80--87"
26
+ }
27
+ ```
28
+
29
+ ### Groups, Tags, and Tasks
30
+
31
+ #### Groups
32
+
33
+ - `aclue`: All 15 subjects of the ACLUE dataset, evaluated following the methodology in CMMLU's original implementation.
34
+
35
+ #### Tasks
36
+
37
+ The following tasks evaluate subjects in the ACLUE dataset using loglikelihood-based multiple-choice scoring:
38
+ - `aclue_{subject_english}`
39
+
40
+ ### Checklist
41
+
42
+ * [x] Is the task an existing benchmark in the literature?
43
+ * [x] Have you referenced the original paper that introduced the task?
44
+ * [x] If yes, does the original paper provide a reference implementation?
45
+ * [x] Yes, original implementation contributed by author of the benchmark
46
+
47
+ If other tasks on this dataset are already supported:
48
+ * [x] Is the "Main" variant of this task clearly denoted?
49
+ * [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
50
+ * [x] Have you noted which, if any, published evaluation setups are matched by this variant?
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_aclue.yaml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ group: aclue
2
+ task:
3
+ - aclue_ancient_chinese_culture
4
+ - aclue_ancient_literature
5
+ - aclue_ancient_medical
6
+ - aclue_ancient_phonetics
7
+ - aclue_basic_ancient_chinese
8
+ - aclue_couplet_prediction
9
+ - aclue_homographic_character_resolution
10
+ - aclue_named_entity_recognition
11
+ - aclue_poetry_appreciate
12
+ - aclue_poetry_context_prediction
13
+ - aclue_poetry_quality_assessment
14
+ - aclue_poetry_sentiment_analysis
15
+ - aclue_polysemy_resolution
16
+ - aclue_reading_comprehension
17
+ - aclue_sentence_segmentation
18
+ aggregate_metric_list:
19
+ - metric: acc
20
+ aggregation: mean
21
+ weight_by_size: true
22
+ - metric: acc_norm
23
+ aggregation: mean
24
+ weight_by_size: true
25
+ metadata:
26
+ version: 1.0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_default_template_yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset_path: tyouisen/aclue
2
+ test_split: test
3
+ fewshot_split: dev
4
+ fewshot_config:
5
+ sampler: first_n
6
+ output_type: multiple_choice
7
+ doc_to_text: "{{Question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n答案:"
8
+ doc_to_choice: ["A", "B", "C", "D"]
9
+ doc_to_target: "{{['A', 'B', 'C', 'D'].index(Answer)}}"
10
+ metric_list:
11
+ - metric: acc
12
+ aggregation: mean
13
+ higher_is_better: true
14
+ - metric: acc_norm
15
+ aggregation: mean
16
+ higher_is_better: true
17
+ metadata:
18
+ version: 1.0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/_generate_configs.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Take in a YAML, and output all other splits with this YAML
3
+ """
4
+
5
+ import argparse
6
+ import os
7
+
8
+ import yaml
9
+ from tqdm import tqdm
10
+
11
+ from lm_eval.utils import eval_logger
12
+
13
+
14
+ SUBJECTS = {
15
+ "古文单字多义": "polysemy_resolution",
16
+ "诗词情感分类": "poetry_sentiment_analysis",
17
+ "古汉语命名体识别": "named_entity_recognition",
18
+ "古汉语知识": "basic_ancient_chinese",
19
+ "古诗词上下句预测": "poetry_context_prediction",
20
+ "古文断句": "sentence_segmentation",
21
+ "对联": "couplet_prediction",
22
+ "古诗词曲鉴赏": "poetry_appreciate",
23
+ "国学常识": "ancient_chinese_culture",
24
+ "古音学": "ancient_phonetics",
25
+ "通假字": "homographic_character_resolution",
26
+ "古代文学知识": "ancient_literature",
27
+ "医古文": "ancient_medical",
28
+ "古诗词质量评估": "poetry_quality_assessment",
29
+ "古文阅读理解": "reading_comprehension",
30
+ }
31
+
32
+
33
+ def parse_args():
34
+ parser = argparse.ArgumentParser()
35
+ parser.add_argument("--base_yaml_path", required=True)
36
+ parser.add_argument("--save_prefix_path", default="aclue")
37
+ parser.add_argument("--cot_prompt_path", default=None)
38
+ parser.add_argument("--task_prefix", default="")
39
+ return parser.parse_args()
40
+
41
+
42
+ if __name__ == "__main__":
43
+ args = parse_args()
44
+
45
+ # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
46
+ base_yaml_name = os.path.split(args.base_yaml_path)[-1]
47
+ with open(args.base_yaml_path, encoding="utf-8") as f:
48
+ base_yaml = yaml.full_load(f)
49
+
50
+ if args.cot_prompt_path is not None:
51
+ import json
52
+
53
+ with open(args.cot_prompt_path, encoding="utf-8") as f:
54
+ cot_file = json.load(f)
55
+
56
+ for subject_zh, subject_eng in tqdm(SUBJECTS.items()):
57
+ if args.cot_prompt_path is not None:
58
+ description = cot_file[subject_eng]
59
+ else:
60
+ description = (
61
+ f"以下是关于{subject_zh}的单项选择题,请直接给出正确答案的选项。\n\n"
62
+ )
63
+
64
+ yaml_dict = {
65
+ "include": base_yaml_name,
66
+ "task": f"aclue_{args.task_prefix}_{subject_eng}"
67
+ if args.task_prefix != ""
68
+ else f"aclue_{subject_eng}",
69
+ "dataset_name": subject_eng,
70
+ "description": description,
71
+ }
72
+
73
+ file_save_path = args.save_prefix_path + f"_{subject_eng}.yaml"
74
+ eval_logger.info(f"Saving yaml for subset {subject_eng} to {file_save_path}")
75
+ with open(file_save_path, "w", encoding="utf-8") as yaml_file:
76
+ yaml.dump(
77
+ yaml_dict,
78
+ yaml_file,
79
+ width=float("inf"),
80
+ allow_unicode=True,
81
+ default_style='"',
82
+ )
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_chinese_culture.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "ancient_chinese_culture"
2
+ "description": "以下是关于国学常识的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_ancient_chinese_culture"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_literature.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "ancient_literature"
2
+ "description": "以下是关于古代文学知识的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_ancient_literature"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_medical.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "ancient_medical"
2
+ "description": "以下是关于医古文的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_ancient_medical"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_ancient_phonetics.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "ancient_phonetics"
2
+ "description": "以下是关于古音学的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_ancient_phonetics"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_basic_ancient_chinese.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "basic_ancient_chinese"
2
+ "description": "以下是关于古汉语知识的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_basic_ancient_chinese"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_couplet_prediction.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "couplet_prediction"
2
+ "description": "以下是关于对联的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_couplet_prediction"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_homographic_character_resolution.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "homographic_character_resolution"
2
+ "description": "以下是关于通假字的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_homographic_character_resolution"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_named_entity_recognition.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "named_entity_recognition"
2
+ "description": "以下是关于古汉语命名体识别的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_named_entity_recognition"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_appreciate.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "poetry_appreciate"
2
+ "description": "以下是关于古诗词曲鉴赏的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_poetry_appreciate"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_context_prediction.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "poetry_context_prediction"
2
+ "description": "以下是关于古诗词上下句预测的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_poetry_context_prediction"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_quality_assessment.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "poetry_quality_assessment"
2
+ "description": "以下是关于古诗词质量评估的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_poetry_quality_assessment"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_poetry_sentiment_analysis.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "poetry_sentiment_analysis"
2
+ "description": "以下是关于诗词情感分类的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_poetry_sentiment_analysis"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_polysemy_resolution.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "polysemy_resolution"
2
+ "description": "以下是关于古文单字多义的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_polysemy_resolution"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_reading_comprehension.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "reading_comprehension"
2
+ "description": "以下是关于古文阅读理解的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_reading_comprehension"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/aclue/aclue_sentence_segmentation.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "sentence_segmentation"
2
+ "description": "以下是关于古文断句的单项选择题,请直接给出正确答案的选项。\n\n"
3
+ "include": "_default_template_yaml"
4
+ "task": "aclue_sentence_segmentation"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Arithmetic
2
+
3
+ ### Paper
4
+
5
+ Title: `Language Models are Few-Shot Learners`
6
+ Abstract: https://arxiv.org/abs/2005.14165
7
+
8
+ A small battery of 10 tests that involve asking language models a simple arithmetic
9
+ problem in natural language.
10
+
11
+ Homepage: https://github.com/openai/gpt-3/tree/master/data
12
+
13
+
14
+ ### Citation
15
+
16
+ ```
17
+ @inproceedings{NEURIPS2020_1457c0d6,
18
+ author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winter, Clemens and Hesse, Chris and Chen, Mark and Sigler, Eric and Litwin, Mateusz and Gray, Scott and Chess, Benjamin and Clark, Jack and Berner, Christopher and McCandlish, Sam and Radford, Alec and Sutskever, Ilya and Amodei, Dario},
19
+ booktitle = {Advances in Neural Information Processing Systems},
20
+ editor = {H. Larochelle and M. Ranzato and R. Hadsell and M. F. Balcan and H. Lin},
21
+ pages = {1877--1901},
22
+ publisher = {Curran Associates, Inc.},
23
+ title = {Language Models are Few-Shot Learners},
24
+ url = {https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf},
25
+ volume = {33},
26
+ year = {2020}
27
+ }
28
+ ```
29
+
30
+ ### Groups, Tags, and Tasks
31
+
32
+ #### Tags
33
+
34
+ * `arithmetic`: Evaluates `1dc` to `5ds`
35
+
36
+ #### Tasks
37
+
38
+ * `arithmetic_1dc`
39
+ * `arithmetic_2da`
40
+ * `arithmetic_2dm`
41
+ * `arithmetic_2ds`
42
+ * `arithmetic_3da`
43
+ * `arithmetic_3ds`
44
+ * `arithmetic_4da`
45
+ * `arithmetic_4ds`
46
+ * `arithmetic_5da`
47
+ * `arithmetic_5ds`
48
+
49
+ ### Checklist
50
+
51
+ For adding novel benchmarks/datasets to the library:
52
+ * [ ] Is the task an existing benchmark in the literature?
53
+ * [ ] Have you referenced the original paper that introduced the task?
54
+ * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
55
+
56
+
57
+ If other tasks on this dataset are already supported:
58
+ * [ ] Is the "Main" variant of this task clearly denoted?
59
+ * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
60
+ * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_1dc.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ tag:
2
+ - arithmetic
3
+ task: arithmetic_1dc
4
+ dataset_path: EleutherAI/arithmetic
5
+ dataset_name: arithmetic_1dc
6
+ output_type: loglikelihood
7
+ validation_split: validation
8
+ test_split: null
9
+ doc_to_text: "{{context}}"
10
+ doc_to_target: "{{completion}}"
11
+ metric_list:
12
+ - metric: acc
13
+ aggregation: mean
14
+ higher_is_better: true
15
+ metadata:
16
+ version: 1.0
17
+ dataset_kwargs:
18
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2da.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_2da
3
+ dataset_name: arithmetic_2da
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2dm.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_2dm
3
+ dataset_name: arithmetic_2dm
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_2ds.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_2ds
3
+ dataset_name: arithmetic_2ds
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3da.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_3da
3
+ dataset_name: arithmetic_3da
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_3ds.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_3ds
3
+ dataset_name: arithmetic_3ds
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4da.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_4da
3
+ dataset_name: arithmetic_4da
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_4ds.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_4ds
3
+ dataset_name: arithmetic_4ds
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5da.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_5da
3
+ dataset_name: arithmetic_5da
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/arithmetic/arithmetic_5ds.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ include: arithmetic_1dc.yaml
2
+ task: arithmetic_5ds
3
+ dataset_name: arithmetic_5ds
4
+ dataset_kwargs:
5
+ trust_remote_code: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_default_ceval_yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset_path: ceval/ceval-exam
2
+ validation_split: val
3
+ fewshot_split: dev
4
+ fewshot_config:
5
+ sampler: first_n
6
+ output_type: multiple_choice
7
+ doc_to_text: "{{question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n答案:"
8
+ doc_to_choice: ["A", "B", "C", "D"]
9
+ doc_to_target: "{{['A', 'B', 'C', 'D'].index(answer)}}"
10
+ metric_list:
11
+ - metric: acc
12
+ aggregation: mean
13
+ higher_is_better: true
14
+ - metric: acc_norm
15
+ aggregation: mean
16
+ higher_is_better: true
17
+ metadata:
18
+ version: 2.0
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/_generate_configs.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Take in a YAML, and output all other splits with this YAML
3
+ """
4
+
5
+ import argparse
6
+ import os
7
+
8
+ import yaml
9
+ from tqdm import tqdm
10
+
11
+ from lm_eval.utils import eval_logger
12
+
13
+
14
+ SUBJECTS = {
15
+ "computer_network": "计算机网络",
16
+ "operating_system": "操作系统",
17
+ "computer_architecture": "计算机组成",
18
+ "college_programming": "大学编程",
19
+ "college_physics": "大学物理",
20
+ "college_chemistry": "大学化学",
21
+ "advanced_mathematics": "高等数学",
22
+ "probability_and_statistics": "概率统计",
23
+ "discrete_mathematics": "离散数学",
24
+ "electrical_engineer": "注册电气工程师",
25
+ "metrology_engineer": "注册计量师",
26
+ "high_school_mathematics": "高中数学",
27
+ "high_school_physics": "高中物理",
28
+ "high_school_chemistry": "高中化学",
29
+ "high_school_biology": "高中生物",
30
+ "middle_school_mathematics": "初中数学",
31
+ "middle_school_biology": "初中生物",
32
+ "middle_school_physics": "初中物理",
33
+ "middle_school_chemistry": "初中化学",
34
+ "veterinary_medicine": "兽医学",
35
+ "college_economics": "大学经济学",
36
+ "business_administration": "工商管理",
37
+ "marxism": "马克思主义基本原理",
38
+ "mao_zedong_thought": "毛泽东思想和中国特色社会主义理论体系概论",
39
+ "education_science": "教育学",
40
+ "teacher_qualification": "教师资格",
41
+ "high_school_politics": "高中政治",
42
+ "high_school_geography": "高中地理",
43
+ "middle_school_politics": "初中政治",
44
+ "middle_school_geography": "初中地理",
45
+ "modern_chinese_history": "近代史纲要",
46
+ "ideological_and_moral_cultivation": "思想道德修养与法律基础",
47
+ "logic": "逻辑学",
48
+ "law": "法学",
49
+ "chinese_language_and_literature": "中国语言文学",
50
+ "art_studies": "艺术学",
51
+ "professional_tour_guide": "导游资格",
52
+ "legal_professional": "法律职业资格",
53
+ "high_school_chinese": "高中语文",
54
+ "high_school_history": "高中历史",
55
+ "middle_school_history": "初中历史",
56
+ "civil_servant": "公务员",
57
+ "sports_science": "体育学",
58
+ "plant_protection": "植物保护",
59
+ "basic_medicine": "基础医学",
60
+ "clinical_medicine": "临床医学",
61
+ "urban_and_rural_planner": "注册城乡规划师",
62
+ "accountant": "注册会计师",
63
+ "fire_engineer": "注册消防工程师",
64
+ "environmental_impact_assessment_engineer": "环境影响评价工程师",
65
+ "tax_accountant": "税务师",
66
+ "physician": "医师资格",
67
+ }
68
+
69
+
70
+ def parse_args():
71
+ parser = argparse.ArgumentParser()
72
+ parser.add_argument("--base_yaml_path", required=True)
73
+ parser.add_argument("--save_prefix_path", default="ceval-valid")
74
+ parser.add_argument("--cot_prompt_path", default=None)
75
+ parser.add_argument("--task_prefix", default="")
76
+ return parser.parse_args()
77
+
78
+
79
+ if __name__ == "__main__":
80
+ args = parse_args()
81
+
82
+ # get filename of base_yaml so we can `"include": ` it in our other YAMLs.
83
+ base_yaml_name = os.path.split(args.base_yaml_path)[-1]
84
+ with open(args.base_yaml_path, encoding="utf-8") as f:
85
+ base_yaml = yaml.full_load(f)
86
+
87
+ if args.cot_prompt_path is not None:
88
+ import json
89
+
90
+ with open(args.cot_prompt_path, encoding="utf-8") as f:
91
+ cot_file = json.load(f)
92
+
93
+ for subject_eng, subject_zh in tqdm(SUBJECTS.items()):
94
+ if args.cot_prompt_path is not None:
95
+ description = cot_file[subject_eng]
96
+ else:
97
+ description = (
98
+ f"以下是中国关于{subject_zh}的单项选择题,请选出其中的正确答案。\n\n"
99
+ )
100
+
101
+ yaml_dict = {
102
+ "include": base_yaml_name,
103
+ "task": f"ceval-valid_{args.task_prefix}_{subject_eng}"
104
+ if args.task_prefix != ""
105
+ else f"ceval-valid_{subject_eng}",
106
+ "dataset_name": subject_eng,
107
+ "description": description,
108
+ }
109
+
110
+ file_save_path = args.save_prefix_path + f"_{subject_eng}.yaml"
111
+ eval_logger.info(f"Saving yaml for subset {subject_eng} to {file_save_path}")
112
+ with open(file_save_path, "w", encoding="utf-8") as yaml_file:
113
+ yaml.dump(
114
+ yaml_dict,
115
+ yaml_file,
116
+ width=float("inf"),
117
+ allow_unicode=True,
118
+ default_style='"',
119
+ )
120
+
121
+ # write group config out
122
+
123
+ group_yaml_dict = {
124
+ "group": "ceval-valid",
125
+ "task": [f"ceval-valid_{task_name}" for task_name in SUBJECTS.keys()],
126
+ "aggregate_metric_list": [
127
+ {"metric": "acc", "aggregation": "mean", "weight_by_size": True},
128
+ {"metric": "acc_norm", "aggregation": "mean", "weight_by_size": True},
129
+ ],
130
+ "metadata": {"version": 1.0},
131
+ }
132
+
133
+ file_save_path = "_" + args.save_prefix_path + ".yaml"
134
+
135
+ with open(file_save_path, "w", encoding="utf-8") as group_yaml_file:
136
+ yaml.dump(
137
+ group_yaml_dict,
138
+ group_yaml_file,
139
+ width=float("inf"),
140
+ allow_unicode=True,
141
+ default_style='"',
142
+ )
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_basic_medicine.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "basic_medicine"
2
+ "description": "以下是中国关于基础医学的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_basic_medicine"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_college_physics.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "college_physics"
2
+ "description": "以下是中国关于大学物理的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_college_physics"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_fire_engineer.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "fire_engineer"
2
+ "description": "以下是中国关于注册消防工程师的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_fire_engineer"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chemistry.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "high_school_chemistry"
2
+ "description": "以下是中国关于高中化学的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_high_school_chemistry"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_chinese.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "high_school_chinese"
2
+ "description": "以下是中国关于高中语文的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_high_school_chinese"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_mathematics.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "high_school_mathematics"
2
+ "description": "以下是中国关于高中数学的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_high_school_mathematics"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_high_school_politics.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "high_school_politics"
2
+ "description": "以下是中国关于高中政治的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_high_school_politics"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_biology.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "middle_school_biology"
2
+ "description": "以下是中国关于初中生物的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_middle_school_biology"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_middle_school_chemistry.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "middle_school_chemistry"
2
+ "description": "以下是中国关于初中化学的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_middle_school_chemistry"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_physician.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "physician"
2
+ "description": "以下是中国关于医师资格的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_physician"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_professional_tour_guide.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "professional_tour_guide"
2
+ "description": "以下是中国关于导游资格的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_professional_tour_guide"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_sports_science.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "sports_science"
2
+ "description": "以下是中国关于体育学的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_sports_science"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/ceval/ceval-valid_teacher_qualification.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ "dataset_name": "teacher_qualification"
2
+ "description": "以下是中国关于教师资格的单项选择题,请选出其中的正确答案。\n\n"
3
+ "include": "_default_ceval_yaml"
4
+ "task": "ceval-valid_teacher_qualification"
scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Task-name
2
+
3
+ ### Paper
4
+
5
+ Title: `COMMONSENSEQA: A Question Answering Challenge Targeting
6
+ Commonsense Knowledge`
7
+
8
+ Abstract: https://arxiv.org/pdf/1811.00937.pdf
9
+
10
+ CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers.
11
+ It contains 12,102 questions with one correct answer and four distractor answers.
12
+
13
+ Homepage: https://www.tau-nlp.org/commonsenseqa
14
+
15
+
16
+ ### Citation
17
+
18
+ ```
19
+ @inproceedings{talmor-etal-2019-commonsenseqa,
20
+ title = "{C}ommonsense{QA}: A Question Answering Challenge Targeting Commonsense Knowledge",
21
+ author = "Talmor, Alon and
22
+ Herzig, Jonathan and
23
+ Lourie, Nicholas and
24
+ Berant, Jonathan",
25
+ booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
26
+ month = jun,
27
+ year = "2019",
28
+ address = "Minneapolis, Minnesota",
29
+ publisher = "Association for Computational Linguistics",
30
+ url = "https://aclanthology.org/N19-1421",
31
+ doi = "10.18653/v1/N19-1421",
32
+ pages = "4149--4158",
33
+ archivePrefix = "arXiv",
34
+ eprint = "1811.00937",
35
+ primaryClass = "cs",
36
+ }
37
+ ```
38
+
39
+ ### Groups and Tasks
40
+
41
+ #### Groups
42
+
43
+ * Not part of a group yet.
44
+
45
+ #### Tasks
46
+
47
+ * `commonsense_qa`: Represents the "random" split from the paper. Uses an MMLU-style prompt, as (presumably) used by Llama evaluations.
48
+
49
+ ### Checklist
50
+
51
+ For adding novel benchmarks/datasets to the library:
52
+ * [x] Is the task an existing benchmark in the literature?
53
+ * [x] Have you referenced the original paper that introduced the task?
54
+ * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
55
+
56
+
57
+ If other tasks on this dataset are already supported:
58
+ * [ ] Is the "Main" variant of this task clearly denoted?
59
+ * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
60
+ * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
scripts/yans/lm-evaluation-harness/lm_eval/tasks/commonsense_qa/default.yaml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ task: commonsense_qa
2
+ dataset_path: tau/commonsense_qa
3
+ training_split: train
4
+ validation_split: validation
5
+ output_type: multiple_choice
6
+ doc_to_text: "Question: {{ question.strip() }}\nA. {{choices['text'][0]}}\nB. {{choices['text'][1]}}\nC. {{choices['text'][2]}}\nD. {{choices['text'][3]}}\nE. {{choices['text'][4]}}\nAnswer:"
7
+ doc_to_target: answerKey
8
+ doc_to_choice: ['A', 'B', 'C', 'D', 'E']
9
+ metric_list:
10
+ - metric: acc
11
+ aggregation: mean
12
+ higher_is_better: true
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # COPAL
2
+
3
+ ### Paper
4
+
5
+ Title: `COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances`
6
+
7
+ Abstract: `https://arxiv.org/abs/2311.01012`
8
+
9
+ `COPAL-ID is an Indonesian causal commonsense reasoning dataset that captures local nuances. It provides a more natural portrayal of day-to-day causal reasoning within the Indonesian (especially Jakartan) cultural sphere. Professionally written and validatid from scratch by natives, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID.`
10
+
11
+ Homepage: `https://github.com/haryoa/copal-id`
12
+
13
+
14
+ ### Citation
15
+
16
+ ```
17
+ @article{wibowo2023copal,
18
+ title={COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances},
19
+ author={Wibowo, Haryo Akbarianto and Fuadi, Erland Hilman and Nityasya, Made Nindyatama and Prasojo, Radityo Eko and Aji, Alham Fikri},
20
+ journal={arXiv preprint arXiv:2311.01012},
21
+ year={2023}
22
+ }
23
+ ```
24
+
25
+ ### Groups and Tasks
26
+
27
+ #### Groups
28
+
29
+ * `copal_id`
30
+
31
+ #### Tasks
32
+
33
+ * `copal_id_standard`: `Standard version of COPAL dataset, use formal language and less local nuances`
34
+ * `copal_id_colloquial`: `Colloquial version of COPAL dataset, use informal language and more local nuances`
35
+
36
+ ### Checklist
37
+
38
+ For adding novel benchmarks/datasets to the library:
39
+ * [x] Is the task an existing benchmark in the literature?
40
+ * [x] Have you referenced the original paper that introduced the task?
41
+ * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
42
+
43
+
44
+ If other tasks on this dataset are already supported:
45
+ * [ ] Is the "Main" variant of this task clearly denoted?
46
+ * [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
47
+ * [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/colloquial.yaml ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ include: standard.yaml
2
+ task: copal_id_colloquial
3
+ task_alias: colloquial
4
+ test_split: test_colloquial
scripts/yans/lm-evaluation-harness/lm_eval/tasks/copal_id/standard.yaml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ tag: copal_id
2
+ task: copal_id_standard
3
+ task_alias: standard
4
+ dataset_path: haryoaw/COPAL
5
+ dataset_name: id
6
+ output_type: multiple_choice
7
+ test_split: test
8
+ doc_to_text: !function utils.doc_to_text_id
9
+ doc_to_target: label
10
+ doc_to_choice: !function utils.doc_to_choice
11
+ metric_list:
12
+ - metric: acc
13
+ metadata:
14
+ version: 1.0