diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/README.md b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..90f8e44bb05394cb95c121946febbaaad6c48d27
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/README.md
@@ -0,0 +1,94 @@
+# MGSM
+
+### Paper
+
+Title: `Language Models are Multilingual Chain-of-Thought Reasoners`
+
+Abstract: https://arxiv.org/abs/2210.03057
+
+Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems, proposed in the paper [Language models are multilingual chain-of-thought reasoners](http://arxiv.org/abs/2210.03057).
+
+The same 250 problems from [GSM8K](https://arxiv.org/abs/2110.14168) are each translated via human annotators in 10 languages. The 10 languages are:
+- Spanish
+- French
+- German
+- Russian
+- Chinese
+- Japanese
+- Thai
+- Swahili
+- Bengali
+- Telugu
+
+GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
+
+You can find the input and targets for each of the ten languages (and English) as `.tsv` files.
+We also include few-shot exemplars that are also manually translated from each language in `exemplars.py`.
+
+Homepage: https://github.com/google-research/url-nlp/tree/main/mgsm
+
+
+### Citation
+
+```
+@misc{cobbe2021training,
+    title={Training Verifiers to Solve Math Word Problems},
+    author={Karl Cobbe and Vineet Kosaraju and Mohammad Bavarian and Jacob Hilton and Reiichiro Nakano and Christopher Hesse and John Schulman},
+    year={2021},
+    eprint={2110.14168},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+@misc{shi2022language,
+    title={Language Models are Multilingual Chain-of-Thought Reasoners},
+    author={Freda Shi and Mirac Suzgun and Markus Freitag and Xuezhi Wang and Suraj Srivats and Soroush Vosoughi and Hyung Won Chung and Yi Tay and Sebastian Ruder and Denny Zhou and Dipanjan Das and Jason Wei},
+    year={2022},
+    eprint={2210.03057},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `mgsm_direct`: Direct question
+  * `mgsm_direct_bn`: Bengali
+  * `mgsm_direct_de`: German
+  * `mgsm_direct_en`: English
+  * `mgsm_direct_es`: Spanish
+  * `mgsm_direct_fr`: French
+  * `mgsm_direct_ja`: Japanese
+  * `mgsm_direct_ru`: Russian
+  * `mgsm_direct_sw`: Swahili
+  * `mgsm_direct_te`: Telugu
+  * `mgsm_direct_th`: Thai
+  * `mgsm_direct_zh`: Chinese
+* `mgsm_cot_native`: Question with Answer followed by CoT prompt in the same language as the dataset.
+  * `mgsm_cot_native_bn`: Bengali
+  * `mgsm_cot_native_de`: German
+  * `mgsm_cot_native_en`: English
+  * `mgsm_cot_native_es`: Spanish
+  * `mgsm_cot_native_fr`: French
+  * `mgsm_cot_native_ja`: Japanese
+  * `mgsm_cot_native_ru`: Russian
+  * `mgsm_cot_native_sw`: Swahili
+  * `mgsm_cot_native_te`: Telugu
+  * `mgsm_cot_native_th`: Thai
+  * `mgsm_cot_native_zh`: Chinese
+
+Examplar Samples: https://github.com/google-research/url-nlp/blob/main/mgsm/exemplars.py
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [ ] Is the task an existing benchmark in the literature?
+  * [ ] Have you referenced the original paper that introduced the task?
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/direct_yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/direct_yaml
new file mode 100644
index 0000000000000000000000000000000000000000..3a265cb025916a00807fefd7c3f39466a4ce80ae
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/direct_yaml
@@ -0,0 +1,35 @@
+# This file will be included in the generated language-specific task configs.
+# It doesn't have a yaml file extension as it is not meant to be imported directly
+# by the harness.
+group: mgsm_direct
+dataset_path: juletxara/mgsm
+dataset_name: null  # Overridden by language-specific config.
+output_type: generate_until
+training_split: train
+test_split: test
+target_delimiter: ""
+generation_kwargs:
+  until:
+    - "\n\n"
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+filter_list:
+  - name: remove_whitespace
+    filter:
+      - function: remove_whitespace
+      - function: take_first
+  - filter:
+    - function: regex
+      group_select: -1
+      regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+    - function: take_first
+    name: flexible-extract
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 2.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_bn.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_bn.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..08e7125127eabeda6fdc08a6a3edd83c84ea277e
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_bn.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: bn
+doc_to_target: '{% if answer is not none %}{{answer[17:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"প্রশ্ন: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'প্রশ্ন:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_bn
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_de.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_de.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..24bc43eda3eaa1815919c9abc7d05697f53be309
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_de.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: de
+doc_to_target: '{% if answer is not none %}{{answer[29:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAntwort:"}}{% else %}{{"Frage: "+question+"\nAntwort:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Frage:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_de
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_en.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_en.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f7ef407d39f7addb0688366cfd98005ee7a8da6b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_en.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: en
+doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Question: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_en
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_es.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_es.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..a6c3c1fd7ed85050098cb4db48db2bdbb86c7db6
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_es.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: es
+doc_to_target: '{% if answer is not none %}{{answer[23:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nRespuesta:"}}{% else %}{{"Pregunta: "+question+"\nRespuesta:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Pregunta:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_es
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_fr.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_fr.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..993c181a97d59c71ee50b67d641995296d373e58
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_fr.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: fr
+doc_to_target: '{% if answer is not none %}{{answer[26:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nRéponse :"}}{% else %}{{"Question : "+question+"\nRéponse :"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question :'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_fr
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ja.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ja.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..7de11a486d4c5eaf7a2675fec8c9812f7beae0c0
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ja.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: ja
+doc_to_target: '{% if answer is not none %}{{answer[11:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"問題: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - '問題:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_ja
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ru.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ru.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..30d1618faacf5712154132b200b333e519426b95
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_ru.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: ru
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Задача: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Задача:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_ru
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_sw.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_sw.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0357902d4eea32b0f4619e32f6806599caac4ae5
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_sw.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: sw
+doc_to_target: '{% if answer is not none %}{{answer[25:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"Swali: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Swali:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_sw
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_te.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_te.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4abdc7e78ec0ddd597d1ff2210a3474ad397a30a
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_te.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: te
+doc_to_target: '{% if answer is not none %}{{answer[19:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"ప్రశ్న: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'ప్రశ్న:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_te
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_th.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_th.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..fcf35a6721ab7faa221e023483c7630040b0e72f
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_th.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: th
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"โจทย์: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'โจทย์:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_th
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_zh.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_zh.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..283e63f8bcd9f910ea9aa7560ed1c68819c0351a
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/direct/mgsm_direct_zh.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: zh
+doc_to_target: '{% if answer is not none %}{{answer[6:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nAnswer:"}}{% else %}{{"问题: "+question+"\nAnswer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - '问题:'
+  - </s>
+  - <|im_end|>
+include: direct_yaml
+task: mgsm_direct_zh
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/cot_yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/cot_yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f4d502ee52f4389d4331be7dcde287d1c47c3f59
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/cot_yaml
@@ -0,0 +1,36 @@
+# This file will be included in the generated language-specific task configs.
+# It doesn't have a yaml file extension as it is not meant to be imported directly
+# by the harness.
+group: mgsm_cot_native
+dataset_path: juletxara/mgsm
+dataset_name: null  # Overridden by language-specific config.
+output_type: generate_until
+training_split: train
+test_split: test
+generation_kwargs:
+  until:
+    - "\n\n"
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+target_delimiter: " "
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
+  - filter:
+    - function: regex
+      group_select: -1
+      regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+    - function: take_first
+    name: flexible-extract
+metadata:
+  version: 2.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_bn.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_bn.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b1c3c2fcd75827bf0c574090bb2adbc3890bdaf4
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_bn.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: bn
+doc_to_target: '{% if answer is not none %}{{answer[17:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"প্রশ্ন: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'প্রশ্ন:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_bn
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_de.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_de.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c2362fb7ac0944da0eae570963603275d459a254
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_de.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: de
+doc_to_target: '{% if answer is not none %}{{answer[29:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Frage: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Frage:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_de
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_en.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_en.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f27a616487aadcda9ac0f6f4e549d9bcd8e26dc1
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_en.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: en
+doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_en
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_es.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_es.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..cc748306a473dd11beace7d35ac7453f187c7abb
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_es.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: es
+doc_to_target: '{% if answer is not none %}{{answer[23:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Pregunta: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Pregunta:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_es
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_fr.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_fr.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d36dd813a3b86b6300620ec5c74ad0154017edf9
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_fr.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: fr
+doc_to_target: '{% if answer is not none %}{{answer[26:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question : "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question :'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_fr
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ja.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ja.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c98060357ebd1ed60b61555c954a035b9e0080f6
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ja.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: ja
+doc_to_target: '{% if answer is not none %}{{answer[11:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"問題: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - '問題:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_ja
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ru.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ru.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2bfeb1dafe3cbd989ba3999394b1ea9a294504f5
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_ru.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: ru
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Задача: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Задача:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_ru
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_sw.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_sw.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..6f37cd3b87eb3660a701eec29ca1d51cc3c630e4
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_sw.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: sw
+doc_to_target: '{% if answer is not none %}{{answer[25:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Swali: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Swali:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_sw
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_te.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_te.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..75da745da1b6c27350be39d9e7c535c1d3c93168
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_te.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: te
+doc_to_target: '{% if answer is not none %}{{answer[19:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"ప్రశ్న: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'ప్రశ్న:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_te
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_th.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_th.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0ff2177b782ef3c939dd649c484a9b5a83501333
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_th.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: th
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"โจทย์: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'โจทย์:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_th
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_zh.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_zh.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f45004aacfd93bc4786b9ebd42cc6283d9a31785
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/en_cot/mgsm_en_cot_zh.yaml
@@ -0,0 +1,12 @@
+# Generated by utils.py
+dataset_name: zh
+doc_to_target: '{% if answer is not none %}{{answer[6:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"问题: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+generation_kwargs:
+  do_sample: false
+  until:
+  - '问题:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_en_cot_zh
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/gen_yaml.sh b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/gen_yaml.sh
new file mode 100644
index 0000000000000000000000000000000000000000..27cbbcfdc7ae6bddb463de0c7ceb8ec467ec9c3b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/gen_yaml.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+python utils.py --overwrite --output-dir direct --mode direct
+python utils.py --overwrite --output-dir en_cot --mode en-cot
+python utils.py --overwrite --output-dir native_cot --mode native-cot
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/cot_yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/cot_yaml
new file mode 100644
index 0000000000000000000000000000000000000000..dbba882225b1d7c9fbe10352c64a381c97a547c7
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/cot_yaml
@@ -0,0 +1,31 @@
+# This file will be included in the generated language-specific task configs.
+# It doesn't have a yaml file extension as it is not meant to be imported directly
+# by the harness.
+group: mgsm_cot_native
+dataset_path: juletxara/mgsm
+dataset_name: null  # Overridden by language-specific config.
+output_type: generate_until
+training_split: train
+test_split: test
+# target_delimiter: ""
+generation_kwargs:
+  until:
+    - "\n\n"
+    - "\n"
+  do_sample: false
+  temperature: 0.0
+target_delimiter: " "
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+filter_list:
+  - name: "get-answer"
+    filter:
+      - function: "regex"
+        regex_pattern: "The answer is (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
+metadata:
+  version: 3.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_bn.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_bn.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..eb58c8753784c250ce24860fd21211b62ef0cc31
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_bn.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: bn
+doc_to_target: '{% if answer is not none %}{{answer[17:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nধাপে ধাপে উত্তর:"}}{% else %}{{"প্রশ্ন: "+question+"\nধাপে ধাপে উত্তর:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: The answer is (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'প্রশ্ন:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_bn
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_de.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_de.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4f4701796945b74fe884a73d931debdf2c7b5ce9
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_de.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: de
+doc_to_target: '{% if answer is not none %}{{answer[29:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nSchritt-für-Schritt-Antwort:"}}{% else %}{{"Frage: "+question+"\nSchritt-für-Schritt-Antwort:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: Die Antwort lautet (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Frage:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_de
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_en.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_en.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c2033b335fb51ec1310f98b4e905f18231c1b68a
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_en.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: en
+doc_to_target: '{% if answer is not none %}{{answer[21:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nStep-by-Step Answer:"}}{% else %}{{"Question: "+question+"\nStep-by-Step Answer:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: The answer is (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_en
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_es.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_es.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..6c39fb9c4740ac571db8165a80fdd7efa108f56b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_es.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: es
+doc_to_target: '{% if answer is not none %}{{answer[23:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nRespuesta paso a paso:"}}{% else %}{{"Pregunta: "+question+"\nRespuesta paso a paso:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: La respuesta es (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Pregunta:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_es
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_fr.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_fr.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b52b881f7a3f8b30d64ce8eb8ee6b308673626c2
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_fr.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: fr
+doc_to_target: '{% if answer is not none %}{{answer[26:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nRéponse étape par étape :"}}{% else %}{{"Question : "+question+"\nRéponse étape par étape :"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: La réponse est (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Question :'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_fr
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ja.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ja.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..8e56bd0b15150e1e435b4d304255c0a751246e86
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ja.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: ja
+doc_to_target: '{% if answer is not none %}{{answer[11:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nステップごとの答え:"}}{% else %}{{"問題: "+question+"\nステップごとの答え:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: 答えは(\-?[0-9\.\,]+)です。
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - '問題:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_ja
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ru.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ru.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..3cff6267a067da1e9d10cfa66aaad7c06618f7ad
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_ru.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: ru
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nПошаговоерешение:"}}{% else %}{{"Задача: "+question+"\nПошаговоерешение:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: Ответ — (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Задача:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_ru
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_sw.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_sw.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..4da793dbc78485cb8167a6fc069b87f7590c960f
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_sw.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: sw
+doc_to_target: '{% if answer is not none %}{{answer[25:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nJibu la Hatua kwa Hatua:"}}{% else %}{{"Swali: "+question+"\nJibu la Hatua kwa Hatua:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: Jibu ni (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'Swali:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_sw
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_te.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_te.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1cdbaca8893b6ee626084135c7a64ccd02737b81
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_te.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: te
+doc_to_target: '{% if answer is not none %}{{answer[19:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nదశలవారీగా సమాధానం:"}}{% else %}{{"ప్రశ్న: "+question+"\nదశలవారీగా సమాధానం:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: సమాధానం (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'ప్రశ్న:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_te
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_th.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_th.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..6931d3a2ff44ab0de25a31a7624f2cd104c655c2
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_th.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: th
+doc_to_target: '{% if answer is not none %}{{answer[18:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\nคำตอบทีละขั้นตอน:"}}{% else %}{{"โจทย์: "+question+"\nคำตอบทีละขั้นตอน:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: คำตอบคือ (\-?[0-9\.\,]+)
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - 'โจทย์:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_th
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_zh.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_zh.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..3f0d7e2dcecaecee05671a636b0a3e27eeeee95e
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/native_cot/mgsm_native_cot_zh.yaml
@@ -0,0 +1,24 @@
+# Generated by utils.py
+dataset_name: zh
+doc_to_target: '{% if answer is not none %}{{answer[6:]}}{% else %}{{answer_number|string}}{% endif %}'
+doc_to_text: '{% if answer is not none %}{{question+"\n逐步解答:"}}{% else %}{{"问题: "+question+"\n逐步解答:"}}{% endif %}'
+filter_list:
+- filter:
+  - function: regex
+    regex_pattern: 答案是 (\-?[0-9\.\,]+)。
+  - function: take_first
+  name: strict-match
+- filter:
+  - function: regex
+    group_select: -1
+    regex_pattern: (-?[$0-9.,]{2,})|(-?[0-9]+)
+  - function: take_first
+  name: flexible-extract
+generation_kwargs:
+  do_sample: false
+  until:
+  - '问题:'
+  - </s>
+  - <|im_end|>
+include: cot_yaml
+task: mgsm_native_cot_zh
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/utils.py b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..116214f9f4c45ffb9a04757ca41c58114180b259
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mgsm/utils.py
@@ -0,0 +1,228 @@
+import argparse
+
+import yaml
+
+
+LANGUAGES = {
+    "bn": {  # Bengali
+        # "QUESTION": "প্রশ্ন:",
+        "QUESTION": "\u09aa\u09cd\u09b0\u09b6\u09cd\u09a8:",
+        # "ANSWER": "ধাপে ধাপে উত্তর:",
+        "ANSWER": "\u09a7\u09be\u09aa\u09c7 \u09a7\u09be\u09aa\u09c7 \u0989\u09a4\u09cd\u09a4\u09b0:",
+        "DIRECT": "Answer:",
+        "REGEX": "The answer is (\\-?[0-9\\.\\,]+)",
+    },
+    "de": {  # German
+        "QUESTION": "Frage:",
+        # "ANSWER": "Schritt-für-Schritt-Antwort:",
+        "ANSWER": "Schritt-f\u00fcr-Schritt-Antwort:",
+        "DIRECT": "Antwort:",
+        "REGEX": "Die Antwort lautet (\\-?[0-9\\.\\,]+)",
+    },
+    "en": {  # English
+        "QUESTION": "Question:",
+        "ANSWER": "Step-by-Step Answer:",
+        "DIRECT": "Answer:",
+        "REGEX": "The answer is (\\-?[0-9\\.\\,]+)",
+    },
+    "es": {  # Spanish
+        "QUESTION": "Pregunta:",
+        "ANSWER": "Respuesta paso a paso:",
+        "DIRECT": "Respuesta:",
+        "REGEX": "La respuesta es (\\-?[0-9\\.\\,]+)",
+    },
+    "fr": {  # French
+        "QUESTION": "Question :",
+        # "ANSWER": "Réponse étape par étape :"
+        "ANSWER": "R\u00e9ponse \u00e9tape par \u00e9tape :",
+        # "DIRECT": "Réponse :",
+        "DIRECT": "R\u00e9ponse :",
+        # "REGEX": "La réponse est (\\-?[0-9\\.\\,]+)",
+        "REGEX": "La r\u00e9ponse est (\\-?[0-9\\.\\,]+)",
+    },
+    "ru": {  # Russian
+        # "QUESTION": "Задача:",
+        "QUESTION": "\u0417\u0430\u0434\u0430\u0447\u0430:",
+        # "ANSWER": "Пошаговоерешение:",
+        "ANSWER": "\u041f\u043e\u0448\u0430\u0433\u043e\u0432\u043e\u0435\u0440\u0435\u0448\u0435\u043d\u0438\u0435:",
+        "DIRECT": "Answer:",
+        # "REGEX": "Ответ — (\\-?[0-9\\.\\,]+)",
+        "REGEX": "\u041e\u0442\u0432\u0435\u0442 \u2014 (\\-?[0-9\\.\\,]+)",
+    },
+    "sw": {  # Swahili
+        "QUESTION": "Swali:",
+        "ANSWER": "Jibu la Hatua kwa Hatua:",
+        "DIRECT": "Answer:",
+        "REGEX": "Jibu ni (\\-?[0-9\\.\\,]+)",
+    },
+    "te": {  # Telugu
+        # "QUESTION": "ప్రశ్న:",
+        "QUESTION": "\u0c2a\u0c4d\u0c30\u0c36\u0c4d\u0c28:",
+        # "ANSWER": "దశలవారీగా సమాధానం:",
+        "ANSWER": "\u0c26\u0c36\u0c32\u0c35\u0c3e\u0c30\u0c40\u0c17\u0c3e \u0c38\u0c2e\u0c3e\u0c27\u0c3e\u0c28\u0c02:",
+        "DIRECT": "Answer:",
+        # "REGEX": "సమాధానం (\\-?[0-9\\.\\,]+)",
+        "REGEX": "\u0c38\u0c2e\u0c3e\u0c27\u0c3e\u0c28\u0c02 (\\-?[0-9\\.\\,]+)",
+    },
+    "th": {  # Thai
+        # "QUESTION": "โจทย์:",
+        "QUESTION": "\u0e42\u0e08\u0e17\u0e22\u0e4c:",
+        # "ANSWER": "คำตอบทีละขั้นตอน:",
+        "ANSWER": "\u0e04\u0e33\u0e15\u0e2d\u0e1a\u0e17\u0e35\u0e25\u0e30\u0e02\u0e31\u0e49\u0e19\u0e15\u0e2d\u0e19:",
+        "DIRECT": "Answer:",
+        # "REGEX": "คำตอบคือ (\\-?[0-9\\.\\,]+)",
+        "REGEX": "\u0e04\u0e33\u0e15\u0e2d\u0e1a\u0e04\u0e37\u0e2d (\\-?[0-9\\.\\,]+)",
+    },
+    "ja": {  # Japanese
+        # "QUESTION": "問題:",
+        "QUESTION": "\u554f\u984c:",
+        # "ANSWER": "ステップごとの答え:",
+        "ANSWER": "\u30b9\u30c6\u30c3\u30d7\u3054\u3068\u306e\u7b54\u3048:",
+        "DIRECT": "Answer:",
+        # "REGEX": "答えは(\\-?[0-9\\.\\,]+)です。",
+        "REGEX": "\u7b54\u3048\u306f(\\-?[0-9\\.\\,]+)\u3067\u3059\u3002",
+    },
+    "zh": {  # Chinese
+        # "QUESTION": "问题:",
+        "QUESTION": "\u95ee\u9898:",
+        # "ANSWER": "逐步解答:",
+        "ANSWER": "\u9010\u6b65\u89e3\u7b54:",
+        "DIRECT": "Answer:",
+        # "REGEX": "答案是 (\\-?[0-9\\.\\,]+)。",
+        "REGEX": "\u7b54\u6848\u662f (\\-?[0-9\\.\\,]+)\u3002",
+    },
+}
+
+
+def add_regex_pattern(regex_pattern):
+    if regex_pattern is None:
+        return {}
+    return {
+        "filter_list": [
+            {
+                "name": "strict-match",
+                "filter": [
+                    {
+                        "function": "regex",
+                        "regex_pattern": f"""{regex_pattern}""",
+                    },
+                    {
+                        "function": "take_first",
+                    },
+                ],
+            },
+            {
+                "name": "flexible-extract",
+                "filter": [
+                    {
+                        "function": "regex",
+                        "regex_pattern": """(-?[$0-9.,]{2,})|(-?[0-9]+)""",
+                        "group_select": -1,
+                    },
+                    {
+                        "function": "take_first",
+                    },
+                ],
+            },
+        ],
+    }
+
+
+def gen_lang_yamls(output_dir: str, overwrite: bool, mode: str) -> None:
+    """
+    Generate a yaml file for each language.
+
+    :param output_dir: The directory to output the files to.
+    :param overwrite: Whether to overwrite files if they already exist.
+    """
+    err = []
+    for lang in LANGUAGES.keys():
+        try:
+            QUESTION = LANGUAGES[lang]["QUESTION"]
+
+            yaml_template = "cot_yaml"
+            filter_list = {}
+            DELIMITER = None
+            if mode == "direct":
+                ANSWER = LANGUAGES[lang]["DIRECT"]
+                REGEX = None
+                task_name = f"mgsm_direct_{lang}"
+                yaml_template = "direct_yaml"
+            elif mode == "native-cot":
+                ANSWER = LANGUAGES[lang]["ANSWER"]
+                REGEX = LANGUAGES[lang]["REGEX"]
+                task_name = f"mgsm_native_cot_{lang}"
+                filter_list = add_regex_pattern(REGEX)
+                DELIMITER = "" if lang in ["zh", "ja"] else None
+            elif mode == "en-cot":
+                ANSWER = LANGUAGES["en"]["ANSWER"]
+                REGEX = LANGUAGES["en"]["REGEX"]
+                task_name = f"mgsm_en_cot_{lang}"
+
+            file_name = f"{task_name}.yaml"
+            ANSWER_TO_SKIP = len(LANGUAGES[lang]["ANSWER"]) + 1
+            with open(
+                f"{output_dir}/{file_name}", "w" if overwrite else "x", encoding="utf8"
+            ) as f:
+                f.write("# Generated by utils.py\n")
+                yaml.dump(
+                    {
+                        "include": yaml_template,
+                        "dataset_name": lang,
+                        "task": f"{task_name}",
+                        "doc_to_text": f"""{{% if answer is not none %}}"""
+                        f"""{{{{question+"\\n{ANSWER}"}}}}"""
+                        f"""{{% else %}}"""
+                        f"""{{{{"{QUESTION} "+question+"\\n{ANSWER}"}}}}"""
+                        f"""{{% endif %}}""",
+                        "doc_to_target": f"""{{% if answer is not none %}}"""
+                        f"""{{{{answer[{ANSWER_TO_SKIP}:]}}}}"""
+                        f"""{{% else %}}"""
+                        f"""{{{{answer_number|string}}}}"""
+                        f"""{{% endif %}}""",
+                        **filter_list,
+                        "generation_kwargs": {
+                            "until": [QUESTION, "</s>", "<|im_end|>"],
+                            "do_sample": False,
+                        },
+                        **({"target_delimiter": DELIMITER} if DELIMITER else {}),
+                    },
+                    f,
+                    allow_unicode=True,
+                    width=float("inf"),
+                )
+        except FileExistsError:
+            err.append(file_name)
+
+    if len(err) > 0:
+        raise FileExistsError(
+            "Files were not created because they already exist (use --overwrite flag):"
+            f" {', '.join(err)}"
+        )
+
+
+def main() -> None:
+    """Parse CLI args and generate language-specific yaml files."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--overwrite",
+        default=False,
+        action="store_true",
+        help="Overwrite files if they already exist",
+    )
+    parser.add_argument(
+        "--output-dir", default=".", help="Directory to write yaml files to"
+    )
+    parser.add_argument(
+        "--mode",
+        default="native-cot",
+        choices=["direct", "native-cot", "en-cot"],
+        help="Mode of chain-of-thought",
+    )
+    args = parser.parse_args()
+
+    gen_lang_yamls(output_dir=args.output_dir, overwrite=args.overwrite, mode=args.mode)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/README.md b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..53694cdc8ab2ea6578e4f7403d23cfaf9486ee9b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/README.md
@@ -0,0 +1,59 @@
+# mmlu_pro
+
+### Paper
+
+Title: `MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark`
+
+Abstract: `In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.`
+
+Homepage: https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
+
+### Citation
+
+```bibtex
+@misc{wang2024mmlupro,
+      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark},
+      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
+      year={2024},
+      eprint={2406.01574},
+      archivePrefix={arXiv},
+      primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+* `mmlu_pro`: 'All 14 subjects of the mmlu_pro dataset, evaluated following the methodology in mmlu's original implementation'
+
+#### Tasks
+
+The following tasks evaluate subjects in the mmlu_pro dataset
+- `mmlu_pro_biology`
+- `mmlu_pro_business`
+- `mmlu_pro_chemistry`
+- `mmlu_pro_computer_science`
+- `mmlu_pro_economics`
+- `mmlu_pro_engineering`
+- `mmlu_pro_health`
+- `mmlu_pro_history`
+- `mmlu_pro_law`
+- `mmlu_pro_math`
+- `mmlu_pro_other`
+- `mmlu_pro_philosophy`
+- `mmlu_pro_physics`
+- `mmlu_pro_psychology`
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [ ] Is the "Main" variant of this task clearly denoted?
+* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/_default_template_yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/_default_template_yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c96aa0c1b4ec37456fca77db0661cdd2497bfd24
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/_default_template_yaml
@@ -0,0 +1,33 @@
+dataset_path: TIGER-Lab/MMLU-Pro
+test_split: test
+fewshot_split: validation
+fewshot_config:
+  sampler: first_n
+  doc_to_text: !function utils.fewshot_to_text
+  doc_to_target: ""
+output_type: generate_until
+doc_to_text: !function utils.doc_to_text
+doc_to_target: answer
+filter_list:
+  - name: "custom-extract"
+    filter:
+      - function: "regex"
+        regex_pattern: r"answer is \(?([ABCDEFGHIJ])\)?"
+        # regex_pattern: r".*[aA]nswer:\s*([A-J])",
+      - function: "take_first"
+generation_kwargs:
+  until:
+    - "</s>"
+    - "Q:"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+num_fewshot: 5
+metric_list:
+  - metric: exact_match
+    aggregation: mean
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_biology.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_biology.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..825f2ad2d94094f59bce5d5ce11da07515d4f026
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_biology.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about biology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_biology"
+task_alias: "biology"
+process_docs: !function utils.process_biology
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_business.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_business.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..f0e5f86a0b23fbf506417d3105fbfe3b121c4656
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_business.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about business. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_business"
+task_alias: "business"
+process_docs: !function utils.process_business
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_chemistry.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_chemistry.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..84510942944f76235673ae1d08cf3cc9e0eac10f
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_chemistry.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about chemistry. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_chemistry"
+task_alias: "chemistry"
+process_docs: !function utils.process_chemistry
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_computer_science.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_computer_science.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..51ca4c29dc853a6b4a94d87c170fdb5eb0a871a6
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_computer_science.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about computer science. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_computer_science"
+task_alias: "computer_science"
+process_docs: !function utils.process_computer_science
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_economics.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_economics.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..9b058739f7a947f3a5724fb85e15db8253ddcd93
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_economics.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about economics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_economics"
+task_alias: "economics"
+process_docs: !function utils.process_economics
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_engineering.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_engineering.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..dbb265bd96314a956e83356fe303a93298c804e1
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_engineering.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about engineering. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_engineering"
+task_alias: "engineering"
+process_docs: !function utils.process_engineering
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_health.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_health.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..ed1a3a538b741e852fbae43dcfda579b343adc50
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_health.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about health. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_health"
+task_alias: "health"
+process_docs: !function utils.process_health
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_history.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_history.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..5ae6fb9c2748ec1c6932e352a5c7eb77b2f6e89c
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_history.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about history. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_history"
+task_alias: "history"
+process_docs: !function utils.process_history
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_law.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_law.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..1197dff38d344c2785447762ce789dcf74c6727b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_law.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about law. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_law"
+task_alias: "law"
+process_docs: !function utils.process_law
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_math.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_math.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..67b3b46dfa49dad7e520c3251457a3c141d0fe4b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_math.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about math. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_math"
+task_alias: "math"
+process_docs: !function utils.process_math
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_other.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_other.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..918608b936e639d08cfd653b173a8ced158d61fa
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_other.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about other. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_other"
+task_alias: "other"
+process_docs: !function utils.process_other
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_philosophy.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_philosophy.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..9eae2b39b7ffdf718bb07e6e84b587f701688a2c
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_philosophy.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about philosophy. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_philosophy"
+task_alias: "philosophy"
+process_docs: !function utils.process_philosophy
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_physics.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_physics.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..00c39623d86b8c84c85e42dab4a05f5a98ab02ef
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_physics.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_physics"
+task_alias: "physics"
+process_docs: !function utils.process_physics
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_psychology.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_psychology.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..5258bced6510479e4e59b071b3f2d59c775363e6
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/mmlu_pro_psychology.yaml
@@ -0,0 +1,5 @@
+description: "The following are multiple choice questions (with answers) about psychology. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice."
+include: "_default_template_yaml"
+task: "mmlu_pro_psychology"
+task_alias: "psychology"
+process_docs: !function utils.process_psychology
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/utils.py b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..15c8b39bbb7a5da0b62f5583765bfdc2fc1442e6
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/mmlu_pro/utils.py
@@ -0,0 +1,63 @@
+from functools import partial
+
+
+choices = [
+    "A",
+    "B",
+    "C",
+    "D",
+    "E",
+    "F",
+    "G",
+    "H",
+    "I",
+    "J",
+    "K",
+    "L",
+    "M",
+    "N",
+    "O",
+    "P",
+]
+
+
+def format_cot_example(example, including_answer=True):
+    prompt = "Question:\n"
+    question = example["question"]
+    options = example["options"]
+    prompt += question + "\n"
+    prompt += "Options:\n"
+    for i, opt in enumerate(options):
+        prompt += "{}. {}\n".format(choices[i], opt)
+    if including_answer:
+        cot_content = example["cot_content"].replace(
+            "A: Let's think step by step.", "Answer: Let's think step by step."
+        )
+        prompt += cot_content + "\n\n"
+    else:
+        prompt += "Answer: Let's think step by step."
+    return prompt
+
+
+doc_to_text = partial(format_cot_example, including_answer=False)
+fewshot_to_text = partial(format_cot_example, including_answer=True)
+
+
+def process_docs(dataset, subject):
+    return dataset.filter(lambda x: x["category"] == subject)
+
+
+process_biology = partial(process_docs, subject="biology")
+process_business = partial(process_docs, subject="business")
+process_chemistry = partial(process_docs, subject="chemistry")
+process_computer_science = partial(process_docs, subject="computer_science")
+process_economics = partial(process_docs, subject="economics")
+process_engineering = partial(process_docs, subject="engineering")
+process_health = partial(process_docs, subject="health")
+process_history = partial(process_docs, subject="history")
+process_law = partial(process_docs, subject="law")
+process_math = partial(process_docs, subject="math")
+process_other = partial(process_docs, subject="other")
+process_philosophy = partial(process_docs, subject="philosophy")
+process_physics = partial(process_docs, subject="physics")
+process_psychology = partial(process_docs, subject="psychology")
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/_generate_configs.py b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/_generate_configs.py
new file mode 100644
index 0000000000000000000000000000000000000000..04d38ed5c74a7428baac602e3a9f1e512c55f92e
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/_generate_configs.py
@@ -0,0 +1,27 @@
+import datasets
+import yaml
+from tqdm import tqdm
+
+
+def main() -> None:
+    dataset_path = "alexandrainst/m_mmlu"
+
+    for task in tqdm(datasets.get_dataset_infos(dataset_path).keys()):
+        file_name = f"m_mmlu_{task}.yaml"
+        try:
+            with open(f"{file_name}", "w") as f:
+                f.write("# Generated by _generate_configs.py\n")
+                yaml.dump(
+                    {
+                        "include": "_default_yaml",
+                        "task": f"{dataset_path.split('/')[-1]}_{task}",
+                        "dataset_name": task,
+                    },
+                    f,
+                )
+        except FileExistsError:
+            pass
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_de.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_de.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..83aaba9ede84d81c61aa839b59720996a403b4d0
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_de.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: de
+include: _default_yaml
+task: m_mmlu_de
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_fr.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_fr.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..eb8cce6ff8c81edd3177a63a36545b706e0d7997
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_fr.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: fr
+include: _default_yaml
+task: m_mmlu_fr
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hu.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hu.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..d824cb768a006f614fa31ff911c9dfffb01bee75
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hu.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: hu
+include: _default_yaml
+task: m_mmlu_hu
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hy.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hy.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..09d2b96d6487c072e71ac66397d670ac9fd1e0b7
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_hy.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: hy
+include: _default_yaml
+task: m_mmlu_hy
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_kn.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_kn.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..82d026c7e4cdc9a58c0df8a360b86d45267ed00b
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_kn.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: kn
+include: _default_yaml
+task: m_mmlu_kn
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_nl.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_nl.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..df115a68d025e6b2ec05c193ba03d8743c0d9629
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_nl.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: nl
+include: _default_yaml
+task: m_mmlu_nl
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_pt.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_pt.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..de4bb65953e675b4609d6e70ad97c421aeacbd8f
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_pt.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: pt
+include: _default_yaml
+task: m_mmlu_pt
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_sk.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_sk.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..61589f04760a34b8fb2aa9405bb6dd1121c1448f
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_sk.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: sk
+include: _default_yaml
+task: m_mmlu_sk
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_ta.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_ta.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..2314894c2ba7d851b36be44d2ad99c895712626e
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/okapi/mmlu_multilingual/m_mmlu_ta.yaml
@@ -0,0 +1,4 @@
+# Generated by _generate_configs.py
+dataset_name: ta
+include: _default_yaml
+task: m_mmlu_ta
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/README.md b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..5e65550045ed2e64b9f15302c7883085d8b582a7
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/README.md
@@ -0,0 +1,130 @@
+# tinyBenchmarks
+
+### Paper
+
+Title: `tinyBenchmarks: evaluating LLMs with fewer examples`
+
+Abstract: https://arxiv.org/abs/2402.14992
+
+The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. For example, we show that to accurately estimate the performance of an LLM on MMLU, a popular multiple-choice QA benchmark consisting of 14K examples, it is sufficient to evaluate this LLM on 100 curated examples. We release evaluation tools and tiny versions of popular benchmarks: Open LLM Leaderboard, MMLU, HELM, and AlpacaEval 2.0. Our empirical analysis demonstrates that these tools and tiny benchmarks are sufficient to reliably and efficiently reproduce the original evaluation results.
+
+Homepage: -
+
+All configs and utils mirror the ones from their original dataset!
+
+### Groups and Tasks
+
+#### Groups
+
+* `tinyBenchmarks`
+
+#### Tasks
+
+* `tinyArc`, `tinyGSM8k`, `tinyHellaswag`, `tinyMMLU`, `tinyTruthfulQA`, `tinyWinogrande`
+
+### Usage
+
+*tinyBenchmarks* can evaluate different benchmarks with a fraction of their examples.
+To obtain accurate results, this task applies post-processing using the *tinyBenchmarks*-package.
+You can install the package by running the following commands on the terminal (for more information see [here](https://github.com/felipemaiapolo/tinyBenchmarks/blob/main/README.md?plain=1)):
+
+``` :sh
+pip install git+https://github.com/felipemaiapolo/tinyBenchmarks
+```
+
+The value that is returned by the task corresponds to the '**IRT++**'-method from the [original paper](https://arxiv.org/abs/2402.14992).
+Evaluate specific tasks individually (e.g. `--tasks tinyHellaswag`) or all [open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) tasks by specifying `--tasks tinyBenchmarks`.
+
+### Advanced usage
+
+To obtain the estimated accuracies from all methods from the original paper, the *tinyBenchmarks*-package has to be applied manually.
+To do so, run the evaluation with the `--log_samples` and `--output_path` arguments. For example:
+
+```bash
+lm_eval --model hf  \
+        --model_args pretrained="mistralai/Mistral-7B-Instruct-v0.2" \
+        --tasks tinyHellaswag \
+        --batch_size 4 \
+        --output_path '<output_path>' \
+        --log_samples
+```
+
+Afterwards, run include the correct `file_path` and run the following script:
+
+```python
+import json
+import tinyBenchmarks as tb
+import numpy as np
+
+# Choose benchmark (e.g. hellaswag)
+benchmark = 'hellaswag' # possible benchmarks:
+                        # ['mmlu','truthfulqa', 'gsm8k',
+                        #  'winogrande', 'arc', 'hellaswag']
+
+# Get score vector from output-file (the metric [here `acc_norm`] depends on the benchmark)
+file_path = '<output_path>/<output-file.jsonl>'
+with open(file_path, 'r') as file:
+    outputs = json.load(file)
+
+# Ensuring correct order of outputs  
+outputs = sorted(outputs, key=lambda x: x['doc_id'])
+
+y = np.array([float(item['acc_norm']) for item in outputs])
+
+### Evaluation
+tb.evaluate(y, benchmark)
+```
+
+### Performance
+
+We report in the following tables the average estimation error in the test set (using data from the paper) and standard deviation across LLMs.
+
+#### Open LLM Leaderboard
+
+Estimating performance for each scenario separately
+|| IRT | p-IRT | gp-IRT |
+|--|--|--|--|
+| TruthfulQA | 0.013 (0.010) | 0.010 (0.009) | 0.011 (0.009) |
+| GSM8K | 0.022 (0.017) | 0.029 (0.022) | 0.020 (0.017) |
+| Winogrande | 0.022 (0.017) | 0.016 (0.014) | 0.015 (0.013) |
+| ARC | 0.022 (0.018) | 0.017 (0.014) | 0.017 (0.013) |
+| HellaSwag | 0.013 (0.016) | 0.015 (0.012) | 0.015 (0.012) |
+| MMLU | 0.024 (0.017) | 0.016 (0.015) | 0.016 (0.015) |
+
+Estimating performance for each scenario all at once
+|| IRT | p-IRT | gp-IRT |
+|--|--|--|--|
+| TruthfulQA  | 0.013 (0.010) | 0.016 (0.013) | 0.011 (0.009) |
+| GSM8K | 0.022 (0.017) | 0.022 (0.017) | 0.020 (0.015) |
+| Winogrande | 0.022 (0.017) | 0.011 (0.013) | 0.011 (0.011) |
+| ARC | 0.022 (0.018) | 0.012 (0.010) | 0.010 (0.009) |
+| HellaSwag | 0.013 (0.016) | 0.011 (0.020) | 0.011 (0.018) |
+| MMLU | 0.024 (0.018) | 0.017 (0.017) | 0.015 (0.015) |
+
+
+
+### Citation
+
+```
+@article{polo2024tinybenchmarks,
+      title={tinyBenchmarks: evaluating LLMs with fewer examples},
+      author={Maia Polo, Felipe and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail},
+      journal={arXiv preprint arXiv:2402.14992},
+      year={2024}
+    }
+```
+
+Please also reference the respective original dataset that you are using!
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [x] Is the task an existing benchmark in the literature?
+  * [x] Have you referenced the original paper that introduced the task?
+  * [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/agg_functions.py b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/agg_functions.py
new file mode 100644
index 0000000000000000000000000000000000000000..ea9a5651856e2658d968e50e2a5ca38488b7640a
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/agg_functions.py
@@ -0,0 +1,54 @@
+from typing import List
+
+import numpy as np
+
+
+try:
+    import tinyBenchmarks as tb
+except ModuleNotFoundError:
+    raise ModuleNotFoundError(
+        "`tinyBenchmarks` is required for tinyBenchmarks task metric calculation, install via \
+`pip install git+https://github.com/felipemaiapolo/tinyBenchmarks`"
+    )
+
+
+def agg_pirt(items: List[float], benchmark: str) -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["pirt"]
+
+
+def agg_gpirt_arc(items: List[float], benchmark: str = "arc") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_gsm8k(items: List[float], benchmark: str = "gsm8k") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_hellaswag(items: List[float], benchmark: str = "hellaswag") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_mmlu(items: List[float], benchmark: str = "mmlu") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_truthfulqa(items: List[float], benchmark: str = "truthfulqa") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
+
+
+def agg_gpirt_winogrande(items: List[float], benchmark: str = "winogrande") -> float:
+    items = np.array(items)
+    predictions = tb.evaluate(items, benchmark)
+    return predictions[benchmark]["gpirt"]
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyGSM8k.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyGSM8k.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..6cf48ee9c22fe4fd3c6ee4ef7291372201b97d92
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyGSM8k.yaml
@@ -0,0 +1,44 @@
+task: tinyGSM8k
+dataset_path: tinyBenchmarks/tinyGSM8k
+dataset_name: main
+output_type: generate_until
+training_split: train
+fewshot_split: train
+test_split: test
+num_fewshot: 5
+doc_to_text: "Question: {{question}}\nAnswer:"
+doc_to_target: "{{answer}}" #" {{answer.split('### ')[-1].rstrip()}}"
+metric_list:
+  - metric: exact_match
+    aggregation: !function agg_functions.agg_gpirt_gsm8k
+    higher_is_better: true
+    ignore_case: true
+    ignore_punctuation: false
+    regexes_to_ignore:
+      - ","
+      - "\\$"
+      - "(?s).*#### "
+      - "\\.$"
+generation_kwargs:
+  until:
+    - "Question:"
+    - "</s>"
+    - "<|im_end|>"
+  do_sample: false
+  temperature: 0.0
+repeats: 1
+num_fewshot: 5
+filter_list:
+  - name: "strict-match"
+    filter:
+      - function: "regex"
+        regex_pattern: "#### (\\-?[0-9\\.\\,]+)"
+      - function: "take_first"
+  - name: "flexible-extract"
+    filter:
+      - function: "regex"
+        group_select: -1
+        regex_pattern: "(-?[$0-9.,]{2,})|(-?[0-9]+)"
+      - function: "take_first"
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyHellaswag.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyHellaswag.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..ba247f8d60b3be2907651b46661a359cd006f5af
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyHellaswag.yaml
@@ -0,0 +1,18 @@
+task: tinyHellaswag
+dataset_path: tinyBenchmarks/tinyHellaswag
+dataset_name: null
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+num_fewshot: 10
+test_split: null
+process_docs: !function utils_hellaswag.process_docs
+doc_to_text: "{{query}}"
+doc_to_target: "{{label}}"
+doc_to_choice: "choices"
+metric_list:
+  - metric: acc_norm
+    aggregation: !function agg_functions.agg_gpirt_hellaswag
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyMMLU.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyMMLU.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..603e1f1b789e4980b1300eb6f210636534a40e1d
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyMMLU.yaml
@@ -0,0 +1,19 @@
+task: tinyMMLU
+dataset_path: tinyBenchmarks/tinyMMLU
+dataset_name: all
+test_split: test
+fewshot_split: dev
+num_fewshot: 0
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{input_formatted}}"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: answer
+num_fewshot: 0
+metric_list:
+  - metric: acc_norm
+    aggregation: !function agg_functions.agg_gpirt_mmlu
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc1.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc1.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..dfa80a36fcc1525889459051d5987616b01e5955
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc1.yaml
@@ -0,0 +1,34 @@
+task: tinyTruthfulQA_mc1
+dataset_path: tinyBenchmarks/tinyTruthfulQA
+dataset_name: multiple_choice
+output_type: multiple_choice
+training_split: null
+validation_split: validation
+test_split: null
+num_fewshot: 0
+doc_to_text: "\
+  {% set prompt_qa = '\
+  Q: What is human life expectancy in the United States?\n\
+  A: Human life expectancy in the United States is 78 years.\n\n\
+  Q: Who was president of the United States in 1955?\n\
+  A: Dwight D. Eisenhower was president of the United States in 1955.\n\n\
+  Q: Which party did he belong to?\n\
+  A: He belonged to the Republican Party.\n\n\
+  Q: What is the square root of banana?\n\
+  A: I have no comment.\n\n\
+  Q: How does a telescope work?\n\
+  A: Telescopes use lenses or mirrors to focus light and make objects appear closer.\n\n\
+  Q: Where were the 1992 Olympics held?\n\
+  A: The 1992 Olympics were held in Barcelona, Spain.\
+  '%}\
+  {{prompt_qa + '\n\nQ: ' + question + '\nA:'}}"
+doc_to_target: 0
+doc_to_choice: "{{mc1_targets.choices}}"
+should_decontaminate: True
+doc_to_decontamination_query: question
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc2.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc2.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..49338cd700037ad23c2a644792e2073bb71a71c2
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyTruthfulQA_mc2.yaml
@@ -0,0 +1,13 @@
+include: tinyTruthfulQA_mc1.yaml
+task: tinyTruthfulQA
+doc_to_target: 0
+doc_to_choice: "{{mc2_targets.choices}}"
+process_results: !function utils_truthfulqa.process_results_mc2
+should_decontaminate: True
+doc_to_decontamination_query: question
+metric_list:
+  - metric: acc
+    aggregation: !function agg_functions.agg_gpirt_truthfulqa
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyWinogrande.yaml b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyWinogrande.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..de98ed0515df1e06d5163770f250a42f17816cc4
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/tinyWinogrande.yaml
@@ -0,0 +1,18 @@
+task: tinyWinogrande
+dataset_path: tinyBenchmarks/tinyWinogrande
+dataset_name: winogrande_xl
+output_type: multiple_choice
+training_split: train
+validation_split: validation
+num_fewshot: 5
+doc_to_text: !function utils_winogrande.doc_to_text
+doc_to_target: !function utils_winogrande.doc_to_target
+doc_to_choice: !function utils_winogrande.doc_to_choice
+should_decontaminate: true
+doc_to_decontamination_query: sentence
+metric_list:
+  - metric: acc_norm
+    aggregation: !function agg_functions.agg_gpirt_winogrande
+    higher_is_better: true
+metadata:
+  version: 0.0
diff --git a/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/utils_winogrande.py b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/utils_winogrande.py
new file mode 100644
index 0000000000000000000000000000000000000000..7103378ac91e2cc59d083cf521b7be30877f7872
--- /dev/null
+++ b/scripts/yans/lm-evaluation-harness/lm_eval/tasks/tinyBenchmarks/utils_winogrande.py
@@ -0,0 +1,17 @@
+"""This code mirrors the utils of the original winogrande task"""
+
+
+def doc_to_text(doc):
+    answer_to_num = {"1": 0, "2": 1}
+    return answer_to_num[doc["answer"]]
+
+
+def doc_to_target(doc):
+    idx = doc["sentence"].index("_") + 1
+    return doc["sentence"][idx:].strip()
+
+
+def doc_to_choice(doc):
+    idx = doc["sentence"].index("_")
+    options = [doc["option1"], doc["option2"]]
+    return [doc["sentence"][:idx] + opt for opt in options]